OpenCores
URL https://opencores.org/ocsvn/mpeg2fpga/mpeg2fpga/trunk

Subversion Repositories mpeg2fpga

[/] [mpeg2fpga/] [trunk/] [tools/] [mpeg2dec/] [AP528.txt] - Blame information for rev 2

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 2 kdv
[Developer Home][Contents][Search][Contact Us][Support][Intel(r)][Image]
2
 
3
[Application Note]
4
 
5
                   Using MMX? Instructions in a Fast iDCT
6
                         Algorithm for MPEG Decoding
7
 
8
 Disclaimer
9
 Information in this document is provided in connection with Intel
10
 products. No license under any patent or copyright is granted expressly
11
 or implied by this publication. Intel assumes no liability whatsoever,
12
 including infringement of any patent or copyright, for sale and use of
13
 Intel products except as provided in Intel's Terms and Conditions of Sale
14
 for such products.
15
 
16
 Intel retains the right to make changes to these specifications at any
17
 time, without notice. Microcomputer Products may have minor variations to
18
 their specifications known as errata.
19
 
20
 MPEG is an international standard for audio and video compression and
21
 decompression promoted by ISO. Implementations of MPEG CODEC?s or MPEG
22
 enabled platforms may require licenses from various entities, including
23
 Intel Corporation. Intel makes no representation as to the need for
24
 licenses from any entity. No licenses, either express, implied or by
25
 estoppel are granted herein. For information on licensing Intel patents,
26
 please contact Intel.
27
 
28
          1.0. INTRODUCTION
29
             * 1.1. MPEG Compression Method
30
 
31
          2.0. iDCT ALGORITHM
32
             * 2.1. Selecting a Fast iDCT Algorithm
33
             * 2.2. AAN Algorithm
34
 
35
          3.0. AAN ALGORITHM IMPLEMENTED WITH MMX[tm] INSTRUCTIONS
36
             * 3.1. iDCT Routine Interface
37
             * 3.2. Optimization Considerations
38
 
39
          4.0. PERFORMANCE GAINS
40
 
41
          5.0. REFERENCES
42
 
43
          6.0. TWO-DIMENSIONAL IDCT CODE LISTING
44
 
45
1.0. INTRODUCTION
46
 
47
The Intel Architecture (IA) media extensions include single-instruction,
48
multi-data (SIMD) instructions. This application note presents examples of
49
code that exploit these instructions. Specifically, this document describes
50
an implementation of a two-dimensional (2D) inverse Discrete Cosine Transfer
51
(iDCT) using MMX[tm] instructions. This transformation is widely used in
52
image compression algorithms; most notably in the Joint Photographic Experts
53
Group (JPEG) and Motion Picture Experts Group (MPEG) standards.
54
 
55
This document focuses on one iDCT algorithm that provides efficient MPEG
56
decoding. The implementation of this algorithm in MMX code, listed in
57
Section 6.0, can be used "as is" according to the guidelines presented in
58
Section 3.1. However, many iDCT algorithms exist. The reader is encouraged
59
to consider the alternative ideas and issues presented in this document,
60
since they have implications for other iDCT algorithms.
61
 
62
MPEG Compression Method
63
 
64
The MPEG compression method has two phases, encoding and decoding.
65
Multimedia images are first encoded, then transmitted, and finally decoded
66
back into a sequence of images by the MPEG player.
67
 
68
The encoding process follows these steps:
69
 
70
  1. The input data is transformed using a Discrete Cosine Transform (DCT).
71
  2. The resulting DCT coefficients are quantized.
72
  3. The quantized coefficients are packed efficiently using Huffman tables.
73
 
74
During the decoding process, the MPEG player reverses these steps by
75
unpacking the coefficients, dequantizing them, and applying a 2D iDCT. To
76
achieve a high number of frames per second, this decoding process must be
77
very fast. This document concentrates on the iDCT component of the decoding
78
process.
79
 
80
2.0. iDCT ALGORITHM
81
 
82
The 8x8 two-dimensional iDCT used in MPEG decompression is defined as:
83
 
84
                                   [Image]
85
 
86
where the normalization factors, and are defined as:
87
 
88
                    alpha () = [Image] = [Image] for = 0
89
 
90
= 1/2 for > 0
91
 
92
Where is either u or v and [Image] is the coefficient matrix.
93
 
94
The solution to this equation contains 64 multiplications and 63 additions
95
for each element of [Image], for a total of 4096 multiplications and 4032
96
additions.
97
 
98
The above equation is equivalent to a summation over v, followed by a
99
summation over u, as follows:
100
 
101
                                   [Image]
102
 
103
This equation is the equivalent to applying a one-dimensional (1D) iDCT
104
eight times on each column of [Image] and then applying a 1D iDCT on the
105
rows of the result. Reversing this order by applying a 1D iDCT on the rows
106
first, and then on the columns, gives the same result.
107
 
108
2.1. Selecting a Fast iDCT Algorithm
109
 
110
Many algorithms have been proposed for efficient calculation of the 2D iDCT.
111
Some algorithms are based on efficient 1D iDCTs [4]; some rely on direct
112
analysis of the two-dimensional nature of iDCT [2]; and others combine a 2D
113
prescale with a very efficient 1D iDCT [1]. Some algorithms even take into
114
account the zero coefficients used in the MPEG bit streams to construct
115
statistically efficient iDCT algorithms [3]. Most fast 1D DCT and iDCT
116
algorithms are variants of Lee's Fast DCT algorithm [6], or are based on
117
variants of Winograd's FFT [5].
118
 
119
The following algorithms were evaluated for implementation using MMX
120
instructions:
121
 
122
   * Statistic algorithm [3]
123
   * Feig's two-dimensional algorithm [2]
124
   * The LLM algorithm [4]
125
   * The AAN algorithm [1]
126
 
127
In general, statistic algorithms inspect the input data and execute
128
conditional paths in the control flow. The overhead caused by the data
129
inspection and the jump instructions would be too expensive when compared to
130
the speed achievable by other algorithms implemented with MMX instructions.
131
 
132
Although Feig's 2D algorithm [2] reduces the multiplication count, the cost
133
of multiplication in MMX technology is small, so this reduction is not
134
critical. Also, the irregular memory-access pattern of this algorithm is not
135
conducive to efficient implementation using the MMX instruction set.
136
 
137
Both the LLM and AAN algorithms were implemented in MMX assembly code. The
138
LLM algorithm was implemented using accumulation in 32-bit elements, while
139
the AAN algorithm used accumulation in 16-bit elements. The resulting LLM
140
implementation was more accurate, conforming to the iDCT IEEE standard [7].
141
However, the AAN implementation was much faster. The AAN implementation is
142
presented in this document.
143
 
144
2.2. AAN Algorithm
145
 
146
The AAN algorithm is a one-dimensional, prescaled DCT/iDCT algorithm. First,
147
the eight input coefficients are prescaled by eight prescale values, which
148
requires eight multiplications. Then the algorithm is applied to the
149
prescaled coefficients, which requires five multiplications and 29 additions
150
for the transform. Although the 1D AAN algorithm is less efficient than
151
other 1D algorithms (for example, LLM), when applied to the two-dimensional,
152
8x8 iDCT, this algorithm takes only 64 multiplications for the prescale, and
153
80 multiplications and 464 additions for the transform.
154
 
155
3.0. AAN ALGORITHM IMPLEMENTED WITH MMX[tm] INSTRUCTIONS
156
 
157
The AAN implementation described in this document uses 16-bit data elements,
158
so that four variables can be processed in parallel using packed words. MMX
159
instructions that operate on packed words read or store four words
160
contiguously in memory. So, for an 8x8 matrix, half a row can be read or
161
stored at one time. If the 1D iDCT is applied to the columns of an 8x8
162
matrix, MMX instructions can operate on four columns at a time. Applying the
163
1D iDCT to the rows of the matrix is more involved and less efficient.
164
 
165
The AAN algorithm is performed in four steps:
166
 
167
  1. Perform an iDCT on the columns of the matrix.
168
  2. Transpose the matrix.
169
  3. Perform a second iDCT.
170
  4. Prescale the input coefficients of the iDCT on the columns of the
171
     transposed matrix, which is equivalent to performing an iDCT on the
172
     rows of the original matrix.
173
 
174
These steps result in a transposed matrix, which would have to be again
175
transposed to obtain the final result. To prevent this extra step, the input
176
matrix should be transposed initially. The cost of transposing the input is
177
negligible, since the matrix is constructed from the Zig-Zag scan [9].
178
Therefore, the actual implementation follows these steps:
179
 
180
  1. Prescale the input coefficients of the transposed input matrix.
181
  2. Perform an iDCT on the columns of the transposed input matrix, which is
182
     equivalent to performing an iDCT on the rows of the final matrix.
183
  3. Transpose the matrix.
184
  4. Perform a second iDCT on the columns of the final matrix.
185
 
186
Another consideration is the limited length of the MMX registers. All the
187
iDCT algorithms mentioned in Section 3.1 are defined mathematically without
188
regard to the size of the accumulator or registers. To ensure adequate
189
precision for operations on 16-bit data elements, the algorithm was analyzed
190
carefully and appropriate precision was assigned during every intermediate
191
stage of the calculation.
192
 
193
Precision was controlled using packed shift instructions, which shift all
194
data elements in a register by the same amount. Shift right instructions
195
were used to prevent overflow of the most significant bit in each
196
intermediate step of the calculation.
197
 
198
3.1. iDCT Routine Interface
199
 
200
The iDCT routine is called from within an assembly module. The routine gets
201
a pointer to an 8x8, 16-bit element matrix; the pointer is located in the
202
ESI register. The matrix should be aligned on an 8-byte boundary. Each data
203
element is actually a 12-bit quantity that is left-adjusted, meaning that
204
the four least significant bits equal zero. The input matrix should be
205
transposed; otherwise, the output matrix must be transposed. The value 0.5
206
must be added to the DC value, input matrix element [0][0].
207
 
208
The routine returns the same memory array. Therefore, if the original input
209
operands are needed (for example, in test mode), they must be copied before
210
the call to the iDCT routine.
211
 
212
3.2. Optimization Considerations
213
 
214
One standard Pentium® processor optimization technique is code rescheduling
215
to exploit parallelism in an algorithm. Parallelism in the 2D iDCT was
216
approached from two directions, as illustrated in Figure 1:
217
 
218
   * Within a single 8x8 iDCT
219
   * Between four 8x8 iDCTs.
220
 
221
In the first approach, data is accessed by rows within the matrix. In the
222
second approach, data from the four matrixes is interleaved to enable
223
efficient use of the MMX instructions.
224
 
225
                    Figure 1. Single iDCT vs. Four iDCTs
226
 
227
                                   [Image]
228
 
229
The advantage of the single-iDCT approach is that the interface to an MPEG
230
player is simpler. Its disadvantages are:
231
 
232
   * The matrix must be transposed in order to operate on several rows in
233
     parallel.
234
   * To prevent overflow, packed shift instructions must be used. Since in a
235
     given register, the accuracy of the four data elements varies, the
236
     shift count is determined by the worst case among the four elements.
237
     This results in an extra loss of accuracy.
238
 
239
The advantages of the four-iDCTs approach are:
240
 
241
   * Matrix transposition is not required.
242
   * To prevent overflow, packed shift instructions must still be used.
243
     However, since all the data elements in a register have the same
244
     accuracy, there is no extra loss of accuracy to accommodate the worst
245
     case among four elements.
246
 
247
The disadvantage of the single-iDCT approach is that, to take advantage of
248
the MMX instructions, the input data from the four matrixes must first be
249
interleaved (see Figure 1). Then, after the transform, the resulting data
250
must be restored to four 8x8 matrixes.
251
 
252
Because of its simplicity, the single-iDCT approach was chosen. Instruction
253
scheduling was done manually to ensure optimal performance.
254
 
255
Register use was carefully considered as well. In most cases, intermediate
256
results were kept in registers; temporary storage to memory was needed in
257
only a few cases. For example, consider the implementation of the matrix
258
transpose. The basic operation is the transpose of 4x4 elements [8], as
259
illustrated in Figure 2.
260
 
261
                    Figure 2. Matrix Transpose Operation
262
 
263
                                   [Image]
264
 
265
The 8x8 iDCT requires four of these operations. The sequence of these
266
operations was carefully chosen to save memory stores and loads. First, M4
267
was transposed, followed by M3. These two results were immediately used to
268
perform the iDCT on the last four columns. Similarly, M2 was transposed,
269
followed by M1. These results were used for the iDCT on the first four
270
columns.
271
 
272
The detailed steps of the matrix transpose algorithm are:
273
 
274
  1. Prescale: 16 packed multiplies
275
  2. Column 0: even part
276
  3. Column 0: odd part
277
  4. Column 0: output butterfly
278
  5. Column 1: even part
279
  6. Column 1: odd part
280
  7. Column 1: output butterfly
281
  8. Transpose: M4 part
282
  9. Transpose: M3 part
283
 10. Column 1: even part (after transpose)
284
 11. Column 1: odd part (after transpose)
285
 12. Column 1: output butterfly (after transform)
286
 13. Transpose: M2 parts
287
 14. Transpose: M1 parts
288
 15. Column 0: odd part (after transpose)
289
 16. Column 0: odd part (after transpose)
290
 17. Column 0: output butterfly (after transform)
291
 18. Cleanup
292
 
293
where:
294
 
295
   * Column 0 represents the first four columns.
296
   * Even part calculates the part of the iDCT that uses even-indexed
297
     elements.
298
   * Odd part calculates the part of the iDCT that uses odd-indexed
299
     elements.
300
   * Output butterfly generates the 1D iDCT using the results of the even
301
     and odd parts.
302
 
303
During the rescheduling process, instructions from one block were moved to
304
the previous block whenever an empty slot could be filled. This reordering
305
is marked by comments in the code listed in Section 6.0.
306
 
307
4.0. PERFORMANCE GAINS
308
 
309
The cycle count for this implementation of the AAN algorithm, using MMX
310
instructions, is 240 clocks. A direct comparison of this implementation with
311
a scalar implementation of the AAN algorithm would be misleading, since the
312
AAN algorithm is not the fastest scalar implementation of an iDCT. However,
313
the implementation presented here is estimated to be 3 to 3.5 times faster
314
than the general performance of scalar iDCT algorithms.
315
 
316
5.0. REFERENCES
317
 
318
  1. Arai, Y., T. Agui, and M. Nakajima, (1988). A Fast DCT-SQ Scheme for
319
     Images; Trans IEICE, 71, pp. 1095-1097.
320
  2. Feig E., and S. Winograd, (1992). Fast Algorithms for Discrete Cosine
321
     Transform, IEEE Trans. Signal Proc., 40, pp. 2174-2193.
322
  3. Hung, A.C., and Thy Meng, (1994). A Fast Statistical Inverse Discrete
323
     cosine Transform for Image Compression, SPIE/IS&Teletronic Imaging
324
     ,2187 , pp. 196-205.
325
  4. Loeffler, C., A. Ligtenberg, and C. S. Moschytz, (1989). Practical Fast
326
     1D DCT Algorithm with Eleven Multiplications, Proc. ICASSP 1989, pp.
327
     988-991.
328
  5. Winograd S. (1976). On Computing the Discrete Fourier Transform, IBM
329
     Res. Rep, RC-6291.
330
  6. Lee, B. A New Algorithm to Compute the Discrete Cosine Transform, IEEE
331
     Trans. Signal Proc., Dec/84, pp. 1243-1245.
332
  7. IEEE standard specification for the implementation of 8x8 iDCT IEEE std
333
     1180-1990
334
  8. MPEG standard, Coding of Moving Pictures, ISO/IEC DIS 11172.
335
 
336
6.0. TWO-DIMENSIONAL IDCT CODE LISTING
337
 
338
; esi - input and output data pointer
339
; the input data is tranposed and each 16 bit element in the 8x8 matrix
340
;is left aligned:
341
; for example in 11...1110000 format
342
; If the iDCT is of I macroblock then 0.5 needs to be added to the
343
;DC Component
344
; (element[0][0] of the matrix)
345
 
346
.nolist
347
include iammx.inc                   ; IAMMX Emulator Macros
348
MMWORD  TEXTEQU 
349
.list
350
 
351
.586
352
.model flat
353
 
354
_DATA SEGMENT PARA PUBLIC USE32 'DATA'
355
x0005000200010001       DQ 0005000200010001h
356
x0040000000000000   DQ 40000000000000h
357
x5a825a825a825a82       DW 5a82h, 5a82h, 5a82h, 5a82h  ; 23170
358
x539f539f539f539f       DW 539fh, 539fh, 539fh, 539fh  ; 21407
359
x4546454645464546       DW 4546h, 4546h, 4546h, 4546h  ; 17734
360
x61f861f861f861f8       DW 61f8h, 61f8h, 61f8h, 61f8h  ; 25080
361
scratch1        DQ 0
362
scratch3        DQ 0
363
scratch5        DQ 0
364
scratch7        DQ 0
365
; for debug only
366
x0      DQ 0
367
 
368
preSC  DW 16384, 22725, 21407, 19266,  16384, 12873, 8867,  4520
369
         DW 22725, 31521, 29692, 26722,  22725, 17855, 12299, 6270
370
         DW 21407, 29692, 27969, 25172,  21407, 16819, 11585, 5906
371
         DW 19266, 26722, 25172, 22654,  19266, 15137, 10426, 5315
372
         DW 16384, 22725, 21407, 19266,  16384, 12873, 8867,  4520
373
         DW 12873, 17855, 16819, 15137,  25746, 20228, 13933, 7103
374
         DW 17734, 24598, 23170, 20853,  17734, 13933, 9597,  4892
375
         DW 18081, 25080, 23624, 21261,  18081, 14206, 9785,  4988
376
 
377
_DATA ENDS
378
 
379
_TEXT SEGMENT PARA PUBLIC USE32 'CODE'
380
 
381
COMMENT ^
382
void idct8x8aan (
383
    int16 *src_result);
384
^
385
public  _idct8x8aan
386
_idct8x8aan proc near
387
 
388
push    ebp
389
lea     ecx, [preSC]
390
 
391
mov     ebp, esp
392
push    esi
393
 
394
mov     esi, DWORD PTR [ebp+8]          ; source
395
;slot
396
 
397
; column 0: even part
398
; use V4, V12, V0, V8 to produce V22..V25
399
movq mm0, mmword ptr [ecx+8*12]         ; maybe the first mul can be done together
400
                                        ; with the dequantization in iHuff module ?
401
;slot
402
 
403
pmulhw mm0, mmword ptr [esi+8*12]       ; V12
404
;slot
405
 
406
movq mm1, mmword ptr [ecx+8*4]
407
;slot
408
 
409
pmulhw mm1, mmword ptr [esi+8*4]        ; V4
410
;slot
411
 
412
movq mm3, mmword ptr [ecx+8*0]
413
psraw mm0, 1                            ; t64=t66
414
 
415
pmulhw mm3, mmword ptr [esi+8*0]        ; V0
416
;slot
417
 
418
movq mm5, mmword ptr [ecx+8*8]          ; duplicate V4
419
movq mm2, mm1                           ; added 11/1/96
420
 
421
pmulhw mm5, mmword ptr [esi+8*8]        ; V8
422
psubsw mm1, mm0                         ; V16
423
 
424
pmulhw mm1, mmword ptr x5a825a825a825a82 ; 23170 ->V18
425
paddsw mm2, mm0                          ; V17
426
 
427
movq mm0, mm2                           ; duplicate V17
428
psraw mm2, 1                            ; t75=t82
429
 
430
psraw mm0, 2                            ; t72
431
movq mm4, mm3                           ; duplicate V0
432
 
433
paddsw mm3, mm5                         ; V19
434
psubsw mm4, mm5                         ; V20 ;mm5 free
435
 
436
;moved from the block below
437
movq mm7, mmword ptr [ecx+8*10]
438
psraw mm3, 1                            ; t74=t81
439
 
440
movq mm6, mm3                           ; duplicate t74=t81
441
psraw mm4, 2                            ; t77=t79
442
 
443
psubsw mm1, mm0                         ; V21 ; mm0 free
444
paddsw mm3, mm2                         ; V22
445
 
446
movq mm5, mm1                           ; duplicate V21
447
paddsw mm1, mm4                         ; V23
448
 
449
movq mmword ptr [esi+8*4], mm3          ; V22
450
psubsw mm4, mm5                         ; V24; mm5 free
451
 
452
movq mmword ptr [esi+8*12], mm1         ; V23
453
psubsw mm6, mm2                         ; V25; mm2 free
454
 
455
movq mmword ptr [esi+8*0], mm4          ; V24
456
;slot
457
 
458
; keep mm6 alive all along the next block
459
;movq mmword ptr [esi+8*8], mm6         ; V25
460
 
461
; column 0: odd part
462
; use V2, V6, V10, V14 to produce V31, V39, V40, V41
463
 
464
;moved above
465
;movq mm7, mmword ptr [ecx+8*10]
466
 
467
pmulhw mm7, mmword ptr [esi+8*10]               ; V10
468
;slot
469
 
470
movq mm0, mmword ptr [ecx+8*6]
471
;slot
472
 
473
pmulhw mm0, mmword ptr [esi+8*6]                ; V6
474
;slot
475
 
476
movq mm5, mmword ptr [ecx+8*2]
477
movq mm3, mm7                                   ; duplicate V10
478
 
479
pmulhw mm5, mmword ptr [esi+8*2]                ; V2
480
;slot
481
 
482
movq mm4, mmword ptr [ecx+8*14]
483
psubsw mm7, mm0                                 ; V26
484
 
485
pmulhw mm4, mmword ptr [esi+8*14]               ; V14
486
paddsw mm3, mm0                                 ; V29 ; free mm0
487
 
488
movq mm1, mm7                                   ; duplicate V26
489
psraw mm3, 1                                    ; t91=t94
490
 
491
pmulhw mm7, mmword ptr x539f539f539f539f        ; V33
492
psraw mm1, 1                                    ; t96
493
 
494
movq mm0, mm5                                   ; duplicate V2
495
psraw mm4, 2                                    ; t85=t87
496
 
497
paddsw mm5, mm4                                 ; V27
498
psubsw mm0, mm4                                 ; V28 ; free mm4
499
 
500
movq mm2, mm0                                   ; duplicate V28
501
psraw mm5, 1                                    ; t90=t93
502
 
503
pmulhw mm0, mmword ptr x4546454645464546        ; V35
504
psraw mm2, 1                                    ; t97
505
 
506
movq mm4, mm5                                   ; duplicate t90=t93
507
psubsw mm1, mm2                                 ; V32 ; free mm2
508
 
509
pmulhw mm1, mmword ptr x61f861f861f861f8        ; V36
510
psllw mm7, 1                                    ; t107
511
 
512
paddsw mm5, mm3                                 ; V31
513
psubsw mm4, mm3                                 ; V30 ; free mm3
514
 
515
pmulhw mm4, mmword ptr x5a825a825a825a82        ; V34
516
nop ;slot
517
 
518
psubsw mm0, mm1                                 ; V38
519
psubsw mm1, mm7                                 ; V37 ; free mm7
520
 
521
psllw mm1, 1                                    ; t114
522
;move from the next block
523
movq mm3, mm6           ; duplicate V25
524
 
525
;move from the next block
526
movq mm7, mmword ptr [esi+8*4]                  ; V22
527
psllw mm0, 1                                    ; t110
528
 
529
psubsw mm0, mm5                                 ; V39 (mm5 still needed for next block)
530
psllw mm4, 2                                    ; t112
531
 
532
;move from the next block
533
movq mm2, mmword ptr [esi+8*12] ; V23
534
psubsw mm4, mm0                                 ; V40
535
 
536
paddsw mm1, mm4                                 ; V41; free mm0
537
;move from the next block
538
psllw mm2, 1                                    ; t117=t125
539
 
540
; column 0: output butterfly
541
;move above
542
;movq mm3, mm6          ; duplicate V25
543
;movq mm7, mmword ptr [esi+8*4]                 ; V22
544
;movq mm2, mmword ptr [esi+8*12]                ; V23
545
;psllw mm2, 1                                   ; t117=t125
546
 
547
psubsw mm6, mm1                                 ; tm6
548
paddsw mm3, mm1                                 ; tm8; free mm1
549
 
550
movq mm1, mm7                                   ; duplicate V22
551
paddsw mm7, mm5                                 ; tm0
552
 
553
movq mmword ptr [esi+8*8], mm3                  ; tm8; free mm3
554
psubsw mm1, mm5                                 ; tm14; free mm5
555
 
556
movq mmword ptr [esi+8*6], mm6                  ; tm6; free mm6
557
movq mm3, mm2                                   ; duplicate t117=t125
558
 
559
movq mm6, mmword ptr [esi+8*0]                  ; V24
560
paddsw mm2, mm0                                 ; tm2
561
 
562
movq mmword ptr [esi+8*0], mm7                  ; tm0; free mm7
563
psubsw mm3, mm0                                 ; tm12; free mm0
564
 
565
movq mmword ptr [esi+8*14], mm1                 ; tm14; free mm1
566
psllw mm6, 1                                    ; t119=t123
567
 
568
movq mmword ptr [esi+8*2], mm2                  ; tm2; free mm2
569
movq mm0, mm6                                   ; duplicate t119=t123
570
 
571
movq mmword ptr [esi+8*12], mm3                 ; tm12; free mm3
572
paddsw mm6, mm4                                 ; tm4
573
 
574
;moved from next block
575
movq mm1, mmword ptr [ecx+8*5]
576
psubsw mm0, mm4                                 ; tm10; free mm4
577
 
578
;moved from next block
579
pmulhw mm1, mmword ptr [esi+8*5]                ; V5
580
;slot
581
 
582
movq mmword ptr [esi+8*4], mm6                  ; tm4; free mm6
583
;slot
584
 
585
movq mmword ptr [esi+8*10], mm0                 ; tm10; free mm0
586
;slot
587
 
588
; column 1: even part
589
; use V5, V13, V1, V9 to produce V56..V59
590
;moved to prev block
591
;movq mm1, mmword ptr [ecx+8*5]
592
;pmulhw mm1, mmword ptr [esi+8*5]               ; V5
593
 
594
movq mm7, mmword ptr [ecx+8*13]
595
psllw mm1, 1                                    ; t128=t130
596
 
597
pmulhw mm7, mmword ptr [esi+8*13]               ; V13
598
movq mm2, mm1                                   ; duplicate t128=t130
599
 
600
movq mm3, mmword ptr [ecx+8*1]
601
;slot
602
 
603
pmulhw mm3, mmword ptr [esi+8*1]                ; V1
604
;slot
605
 
606
movq mm5, mmword ptr [ecx+8*9]
607
psubsw mm1, mm7                                 ; V50
608
 
609
pmulhw mm5, mmword ptr [esi+8*9]                ; V9
610
paddsw mm2, mm7                                 ; V51
611
 
612
pmulhw mm1, mmword ptr x5a825a825a825a82        ; 23170 ->V52
613
movq mm6, mm2                                   ; duplicate V51
614
 
615
psraw mm2, 1                                    ; t138=t144
616
movq mm4, mm3                                   ; duplicate V1
617
 
618
psraw mm6, 2                                    ; t136
619
paddsw mm3, mm5                                 ; V53
620
 
621
psubsw mm4, mm5                                 ; V54 ;mm5 free
622
movq mm7, mm3                                   ; duplicate V53
623
 
624
;moved from next block
625
movq mm0, mmword ptr [ecx+8*11]
626
psraw mm4, 1                                    ; t140=t142
627
 
628
psubsw mm1, mm6                                 ; V55 ; mm6 free
629
paddsw mm3, mm2                                 ; V56
630
 
631
movq mm5, mm4                                   ; duplicate t140=t142
632
paddsw mm4, mm1                                 ; V57
633
 
634
movq mmword ptr [esi+8*5], mm3                  ; V56
635
psubsw mm5, mm1                                 ; V58; mm1 free
636
 
637
movq mmword ptr [esi+8*13], mm4                 ; V57
638
psubsw mm7, mm2                                 ; V59; mm2 free
639
 
640
movq mmword ptr [esi+8*9], mm5                  ; V58
641
;slot
642
 
643
; keep mm7 alive all along the next block
644
;movq mmword ptr [esi+8*1], mm7                 ; V59
645
 
646
;moved above
647
;movq mm0, mmword ptr [ecx+8*11]
648
 
649
pmulhw mm0, mmword ptr [esi+8*11]               ; V11
650
;slot
651
 
652
movq mm6, mmword ptr [ecx+8*7]
653
;slot
654
 
655
pmulhw mm6, mmword ptr [esi+8*7]                ; V7
656
;slot
657
 
658
movq mm4, mmword ptr [ecx+8*15]
659
movq mm3, mm0                                   ; duplicate V11
660
 
661
pmulhw mm4, mmword ptr [esi+8*15]               ; V15
662
;slot
663
 
664
movq mm5, mmword ptr [ecx+8*3]
665
psllw mm6,1                                     ; t146=t152
666
 
667
pmulhw mm5, mmword ptr [esi+8*3]                ; V3
668
paddsw mm0, mm6                                 ; V63
669
 
670
; note that V15 computation has a correction step:
671
; this is a 'magic' constant that rebiases the results to be closer to the expected result
672
; this magic constant can be refined to reduce the error even more
673
; by doing the correction step in a later stage when the number is actually multiplied by 16
674
 
675
paddw mm4, mmword ptr x0005000200010001
676
psubsw mm3, mm6                                 ; V60 ; free mm6
677
 
678
psraw mm0, 1                                    ; t154=t156
679
movq mm1, mm3                                   ; duplicate V60
680
 
681
pmulhw mm1, mmword ptr x539f539f539f539f        ; V67
682
movq mm6, mm5                                   ; duplicate V3
683
 
684
psraw mm4, 2                                    ; t148=t150
685
;slot
686
 
687
paddsw mm5, mm4                                 ; V61
688
psubsw mm6, mm4                                 ; V62 ; free mm4
689
 
690
movq mm4, mm5                                   ; duplicate V61
691
psllw mm1, 1                                    ; t169
692
 
693
paddsw mm5, mm0                                 ; V65 -> result
694
psubsw mm4, mm0                                 ; V64 ; free mm0
695
 
696
pmulhw mm4, mmword ptr x5a825a825a825a82        ; V68
697
psraw mm3, 1                                    ; t158
698
 
699
psubsw mm3, mm6                                 ; V66
700
movq mm2, mm5                                   ; duplicate V65
701
 
702
pmulhw mm3, mmword ptr x61f861f861f861f8        ; V70
703
psllw mm6, 1                                    ; t165
704
 
705
pmulhw mm6, mmword ptr x4546454645464546        ; V69
706
psraw mm2, 1                                    ; t172
707
 
708
;moved from next block
709
movq mm0, mmword ptr [esi+8*5]                  ; V56
710
psllw mm4, 1                                    ; t174
711
 
712
;moved from next block
713
psraw mm0, 1                                    ; t177=t188
714
nop ; slot
715
 
716
psubsw mm6, mm3                                 ; V72
717
psubsw mm3, mm1                                 ; V71 ; free mm1
718
 
719
psubsw mm6, mm2                                 ; V73 ; free mm2
720
;moved from next block
721
psraw mm5, 1                                    ; t178=t189
722
 
723
psubsw mm4, mm6                                 ; V74
724
;moved from next block
725
movq mm1, mm0                                   ; duplicate t177=t188
726
 
727
paddsw mm3, mm4                                 ; V75
728
;moved from next block
729
paddsw mm0, mm5                                 ; tm1
730
 
731
;location
732
;  5 - V56
733
; 13 - V57
734
;  9 - V58
735
;  X - V59, mm7
736
;  X - V65, mm5
737
;  X - V73, mm6
738
;  X - V74, mm4
739
;  X - V75, mm3
740
; free mm0, mm1 & mm2
741
;move above
742
;movq mm0, mmword ptr [esi+8*5]                 ; V56
743
;psllw mm0, 1                                   ; t177=t188 ! new !!
744
;psllw mm5, 1                                   ; t178=t189 ! new !!
745
;movq mm1, mm0                                  ; duplicate t177=t188
746
;paddsw mm0, mm5                                        ; tm1
747
 
748
movq mm2, mmword ptr [esi+8*13]                 ; V57
749
psubsw mm1, mm5                                 ; tm15; free mm5
750
 
751
movq mmword ptr [esi+8*1], mm0                  ; tm1; free mm0
752
psraw mm7, 1                                    ; t182=t184 ! new !!
753
 
754
;save the store as used directly in the transpose
755
;movq mmword ptr [esi+8*15], mm1                ; tm15; free mm1
756
movq mm5, mm7                                   ; duplicate t182=t184
757
psubsw mm7, mm3                                 ; tm7
758
 
759
paddsw mm5, mm3                                 ; tm9; free mm3
760
;slot
761
 
762
movq mm0, mmword ptr [esi+8*9]                  ; V58
763
movq mm3, mm2                                   ; duplicate V57
764
 
765
movq mmword ptr [esi+8*7], mm7                  ; tm7; free mm7
766
psubsw mm3, mm6                                 ; tm13
767
 
768
paddsw mm2, mm6                                 ; tm3 ; free mm6
769
; moved up from the transpose
770
movq mm7, mm3
771
 
772
; moved up from the transpose
773
punpcklwd mm3, mm1
774
movq mm6, mm0                                   ; duplicate V58
775
 
776
movq mmword ptr [esi+8*3], mm2                  ; tm3; free mm2
777
paddsw mm0, mm4                                 ; tm5
778
 
779
psubsw mm6, mm4                                 ; tm11; free mm4
780
; moved up from the transpose
781
punpckhwd mm7, mm1
782
 
783
movq mmword ptr [esi+8*5], mm0                  ; tm5; free mm0
784
; moved up from the transpose
785
movq mm2, mm5
786
 
787
; transpose - M4 part
788
;  ---------       ---------
789
; | M1 | M2 |     | M1'| M3'|
790
;  ---------  -->  ---------
791
; | M3 | M4 |     | M2'| M4'|
792
;  ---------       ---------
793
; Two alternatives: use full mmword approach so the following code can be
794
; scheduled before the transpose is done without stores, or use the faster
795
; half mmword stores (when possible)
796
 
797
movdf dword ptr [esi+8*9+4], mm3        ; MS part of tmt9
798
punpcklwd mm5, mm6
799
 
800
movdf dword ptr [esi+8*13+4], mm7       ; MS part of tmt13
801
punpckhwd mm2, mm6
802
 
803
movdf dword ptr [esi+8*9], mm5          ; LS part of tmt9
804
punpckhdq mm5, mm3                              ; free mm3
805
 
806
movdf dword ptr [esi+8*13], mm2         ; LS part of tmt13
807
punpckhdq mm2, mm7                              ; free mm7
808
 
809
; moved up from the M3 transpose
810
movq mm0, mmword ptr [esi+8*8]
811
;slot
812
 
813
; moved up from the M3 transpose
814
movq mm1, mmword ptr [esi+8*10]
815
; moved up from the M3 transpose
816
movq mm3, mm0
817
 
818
; shuffle the rest of the data, and write it with 2 mmword writes
819
movq mmword ptr [esi+8*11], mm5         ; tmt11
820
; moved up from the M3 transpose
821
punpcklwd mm0, mm1
822
 
823
movq mmword ptr [esi+8*15], mm2         ; tmt15
824
; moved up from the M3 transpose
825
punpckhwd mm3, mm1
826
 
827
; transpose - M3 part
828
 
829
; moved up to previous code section
830
;movq mm0, mmword ptr [esi+8*8]
831
;movq mm1, mmword ptr [esi+8*10]
832
;movq mm3, mm0
833
;punpcklwd mm0, mm1
834
;punpckhwd mm3, mm1
835
 
836
movq mm6, mmword ptr [esi+8*12]
837
;slot
838
 
839
movq mm4, mmword ptr [esi+8*14]
840
movq mm2, mm6
841
 
842
; shuffle the data and write out the lower parts of the transposed in 4 dwords
843
punpcklwd mm6, mm4
844
movq mm1, mm0
845
 
846
punpckhdq mm1, mm6
847
movq mm7, mm3
848
 
849
punpckhwd mm2, mm4                              ; free mm4
850
;slot
851
 
852
punpckldq mm0, mm6                              ; free mm6
853
;slot
854
 
855
;moved from next block
856
movq mm4, mmword ptr [esi+8*13]                 ; tmt13
857
punpckldq mm3, mm2
858
 
859
punpckhdq mm7, mm2                              ; free mm2
860
;moved from next block
861
movq mm5, mm3                                   ; duplicate tmt5
862
 
863
; column 1: even part (after transpose)
864
 
865
;moved above
866
;movq mm5, mm3                                  ; duplicate tmt5
867
;movq mm4, mmword ptr [esi+8*13]                ; tmt13
868
 
869
psubsw mm3, mm4                                 ; V134
870
;slot
871
 
872
pmulhw mm3, mmword ptr x5a825a825a825a82        ; 23170 ->V136
873
;slot
874
 
875
movq mm6, mmword ptr [esi+8*9]                  ; tmt9
876
paddsw mm5, mm4                                 ; V135 ; mm4 free
877
 
878
movq mm4, mm0                                   ; duplicate tmt1
879
paddsw mm0, mm6                                 ; V137
880
 
881
psubsw mm4, mm6                                 ; V138 ; mm6 free
882
psllw mm3, 2                                    ; t290
883
psubsw mm3, mm5                                 ; V139
884
movq mm6, mm0                                   ; duplicate V137
885
 
886
paddsw mm0, mm5                                 ; V140
887
movq mm2, mm4                                   ; duplicate V138
888
 
889
paddsw mm2, mm3                                 ; V141
890
psubsw mm4, mm3                                 ; V142 ; mm3 free
891
 
892
movq mmword ptr [esi+8*9], mm0                  ; V140
893
psubsw mm6, mm5                                 ; V143 ; mm5 free
894
 
895
;moved from next block
896
movq mm0, mmword ptr[esi+8*11]                  ; tmt11
897
;slot
898
 
899
movq mmword ptr [esi+8*13], mm2                 ; V141
900
;moved from next block
901
movq mm2, mm0                                   ; duplicate tmt11
902
 
903
; column 1: odd part (after transpose)
904
 
905
;moved up to the prev block
906
;movq mm0, mmword ptr[esi+8*11]                 ; tmt11
907
;movq mm2, mm0                                  ; duplicate tmt11
908
 
909
movq mm5, mmword ptr[esi+8*15]                  ; tmt15
910
psubsw mm0, mm7                                 ; V144
911
 
912
movq mm3, mm0                                   ; duplicate V144
913
paddsw mm2, mm7                                 ; V147 ; free mm7
914
 
915
pmulhw mm0, mmword ptr x539f539f539f539f        ; 21407-> V151
916
movq mm7, mm1                                   ; duplicate tmt3
917
 
918
paddsw mm7, mm5                                 ; V145
919
psubsw mm1, mm5                                 ; V146 ; free mm5
920
 
921
psubsw mm3, mm1                                 ; V150
922
movq mm5, mm7                                   ; duplicate V145
923
 
924
pmulhw mm1, mmword ptr x4546454645464546        ; 17734-> V153
925
psubsw mm5, mm2                                 ; V148
926
 
927
pmulhw mm3, mmword ptr x61f861f861f861f8        ; 25080-> V154
928
psllw mm0, 2                                    ; t311
929
 
930
pmulhw mm5, mmword ptr x5a825a825a825a82        ; 23170-> V152
931
paddsw mm7, mm2                                 ; V149 ; free mm2
932
 
933
psllw mm1, 1                                    ; t313
934
nop ; slot
935
 
936
;without the nop above - freeze here for one clock
937
;the nop cleans the mess a little bit
938
movq mm2, mm3                                   ; duplicate V154
939
psubsw mm3, mm0                                 ; V155 ; free mm0
940
 
941
psubsw mm1, mm2                                 ; V156 ; free mm2
942
;moved from the next block
943
movq mm2, mm6                                   ; duplicate V143
944
 
945
;moved from the next block
946
movq mm0, mmword ptr[esi+8*13]  ; V141
947
psllw mm1, 1                                    ; t315
948
 
949
psubsw mm1, mm7                                 ; V157 (keep V149)
950
psllw mm5, 2                                    ; t317
951
 
952
psubsw mm5, mm1                                 ; V158
953
psllw mm3, 1                                    ; t319
954
 
955
paddsw mm3, mm5                                 ; V159
956
;slot
957
 
958
; column 1: output butterfly (after transform)
959
;moved to the prev block
960
;movq mm2, mm6                                  ; duplicate V143
961
;movq mm0, mmword ptr[esi+8*13] ; V141
962
 
963
psubsw mm2, mm3                                 ; V163
964
paddsw mm6, mm3                                 ; V164 ; free mm3
965
 
966
movq mm3, mm4                                   ; duplicate V142
967
psubsw mm4, mm5                                 ; V165 ; free mm5
968
 
969
movq mmword ptr scratch7, mm2                   ; out7
970
psraw mm6, 4
971
 
972
psraw mm4, 4
973
paddsw mm3, mm5                                 ; V162
974
 
975
movq mm2, mmword ptr[esi+8*9]                   ; V140
976
movq mm5, mm0                                   ; duplicate V141
977
 
978
;in order not to perculate this line up, we read [esi+8*9] very near to this location
979
movq mmword ptr [esi+8*9], mm6                  ; out9
980
paddsw mm0, mm1                                 ; V161
981
 
982
movq mmword ptr scratch5, mm3                   ; out5
983
psubsw mm5, mm1                                 ; V166 ; free mm1
984
 
985
movq mmword ptr[esi+8*11], mm4                  ; out11
986
psraw mm5, 4
987
 
988
movq mmword ptr scratch3, mm0                   ; out3
989
movq mm4, mm2                                   ; duplicate V140
990
 
991
movq mmword ptr[esi+8*13], mm5                  ; out13
992
paddsw mm2, mm7                                 ; V160
993
 
994
;moved from the next block
995
movq mm0, mmword ptr [esi+8*1]
996
psubsw mm4, mm7                                 ; V167 ; free mm7
997
 
998
;moved from the next block
999
movq mm7, mmword ptr [esi+8*3]
1000
psraw mm4, 4
1001
 
1002
movq mmword ptr scratch1, mm2                   ; out1
1003
;moved from the next block
1004
movq mm1, mm0
1005
 
1006
movq mmword ptr[esi+8*15], mm4                  ; out15
1007
;moved from the next block
1008
punpcklwd mm0, mm7
1009
 
1010
; transpose - M2 parts
1011
;moved up to the prev block
1012
;movq mm0, mmword ptr [esi+8*1]
1013
;movq mm7, mmword ptr [esi+8*3]
1014
;movq mm1, mm0
1015
;punpcklwd mm0, mm7
1016
 
1017
movq mm5, mmword ptr [esi+8*5]
1018
punpckhwd mm1, mm7
1019
 
1020
movq mm4, mmword ptr [esi+8*7]
1021
movq mm3, mm5
1022
 
1023
; shuffle the data and write out the lower parts of the trasposed in 4 dwords
1024
movdf dword ptr [esi+8*8], mm0          ; LS part of tmt8
1025
punpcklwd mm5, mm4
1026
 
1027
movdf dword ptr [esi+8*12], mm1                 ; LS part of tmt12
1028
punpckhwd mm3, mm4
1029
 
1030
movdf dword ptr [esi+8*8+4], mm5                ; MS part of tmt8
1031
punpckhdq mm0, mm5                              ; tmt10
1032
 
1033
movdf dword ptr [esi+8*12+4], mm3               ; MS part of tmt12
1034
punpckhdq mm1, mm3                              ; tmt14
1035
 
1036
; transpose - M1 parts
1037
movq mm7, mmword ptr [esi]
1038
;slot
1039
 
1040
movq mm2, mmword ptr [esi+8*2]
1041
movq mm6, mm7
1042
 
1043
movq mm5, mmword ptr [esi+8*4]
1044
punpcklwd mm7, mm2
1045
 
1046
movq mm4, mmword ptr [esi+8*6]
1047
punpckhwd mm6, mm2 ; free mm2
1048
 
1049
movq mm3, mm5
1050
punpcklwd mm5, mm4
1051
 
1052
punpckhwd mm3, mm4                              ; free mm4
1053
movq mm2, mm7
1054
 
1055
movq mm4, mm6
1056
punpckldq mm7, mm5                              ; tmt0
1057
 
1058
punpckhdq mm2, mm5                              ; tmt2 ; free mm5
1059
;slot
1060
 
1061
; shuffle the rest of the data, and write it with 2 mmword writes
1062
punpckldq mm6, mm3                              ; tmt4
1063
;move from next block
1064
movq mm5, mm2                                   ; duplicate tmt2
1065
 
1066
punpckhdq mm4, mm3                              ; tmt6 ; free mm3
1067
;move from next block
1068
movq mm3, mm0                                   ; duplicate tmt10
1069
 
1070
; column 0: odd part (after transpose)
1071
;moved up to prev block
1072
;movq mm3, mm0                                  ; duplicate tmt10
1073
;movq mm5, mm2                                  ; duplicate tmt2
1074
 
1075
psubsw mm0, mm4                                 ; V110
1076
paddsw mm3, mm4                                 ; V113 ; free mm4
1077
 
1078
movq mm4, mm0                                   ; duplicate V110
1079
paddsw mm2, mm1                                 ; V111
1080
 
1081
pmulhw mm0, mmword ptr x539f539f539f539f        ; 21407-> V117
1082
psubsw mm5, mm1                                 ; V112 ; free mm1
1083
 
1084
psubsw mm4, mm5                                 ; V116
1085
movq mm1, mm2                                   ; duplicate V111
1086
 
1087
pmulhw mm5, mmword ptr x4546454645464546        ; 17734-> V119
1088
psubsw mm2, mm3                                 ; V114
1089
 
1090
pmulhw mm4, mmword ptr x61f861f861f861f8        ; 25080-> V120
1091
paddsw mm1, mm3                                 ; V115 ; free mm3
1092
 
1093
pmulhw mm2, mmword ptr x5a825a825a825a82        ; 23170-> V118
1094
psllw mm0, 2                                    ; t266
1095
 
1096
movq mmword ptr[esi+8*0], mm1                   ; save V115
1097
psllw mm5, 1                                    ; t268
1098
 
1099
psubsw mm5, mm4                                 ; V122
1100
psubsw mm4, mm0                                 ; V121 ; free mm0
1101
 
1102
psllw mm5, 1                                    ; t270
1103
;slot
1104
 
1105
psubsw mm5, mm1                                 ; V123 ; free mm1
1106
psllw mm2, 2                                    ; t272
1107
 
1108
psubsw mm2, mm5                                 ; V124 (keep V123)
1109
psllw mm4, 1                                    ; t274
1110
 
1111
movq mmword ptr[esi+8*2], mm5                   ; save V123 ; free mm5
1112
paddsw mm4, mm2                                 ; V125 (keep V124)
1113
 
1114
; column 0: even part (after transpose)
1115
movq mm0, mmword ptr[esi+8*12]                  ; tmt12
1116
movq mm3, mm6                                   ; duplicate tmt4
1117
 
1118
psubsw mm6, mm0                                 ; V100
1119
paddsw mm3, mm0                                 ; V101 ; free mm0
1120
 
1121
pmulhw mm6, mmword ptr  x5a825a825a825a82       ; 23170 ->V102
1122
movq mm5, mm7                                   ; duplicate tmt0
1123
 
1124
movq mm1, mmword ptr[esi+8*8]                   ; tmt8
1125
;slot
1126
 
1127
paddsw mm7, mm1                                 ; V103
1128
psubsw mm5, mm1                                 ; V104 ; free mm1
1129
 
1130
movq mm0, mm7                                   ; duplicate V103
1131
psllw mm6, 2                                    ; t245
1132
 
1133
paddsw mm7, mm3                                 ; V106
1134
movq mm1, mm5                                   ; duplicate V104
1135
 
1136
psubsw mm6, mm3                                 ; V105
1137
psubsw mm0, mm3                                 ; V109; free mm3
1138
 
1139
paddsw mm5, mm6                                 ; V107
1140
psubsw mm1, mm6                                 ; V108 ; free mm6
1141
 
1142
; column 0: output butterfly (after transform)
1143
movq mm3, mm1                                   ; duplicate V108
1144
paddsw mm1, mm2                                 ; out4
1145
 
1146
psraw mm1, 4
1147
psubsw mm3, mm2                                 ; out10 ; free mm2
1148
 
1149
psraw mm3, 4
1150
movq mm6, mm0                                   ; duplicate V109
1151
 
1152
movq mmword ptr[esi+8*4], mm1                   ; out4 ; free mm1
1153
psubsw mm0, mm4                                 ; out6
1154
 
1155
movq mmword ptr[esi+8*10], mm3                  ; out10 ; free mm3
1156
psraw mm0, 4
1157
 
1158
paddsw mm6, mm4                                 ; out8 ; free mm4
1159
movq mm1, mm7                                   ; duplicate V106
1160
 
1161
movq mmword ptr[esi+8*6], mm0                   ; out6 ; free mm0
1162
psraw mm6, 4
1163
 
1164
movq mm4, mmword ptr[esi+8*0]                   ; V115
1165
;slot
1166
 
1167
movq mmword ptr[esi+8*8], mm6                   ; out8 ; free mm6
1168
movq mm2, mm5   ; duplicate V107
1169
 
1170
movq mm3, mmword ptr[esi+8*2]                   ; V123
1171
paddsw mm7, mm4                                 ; out0
1172
 
1173
;moved up from next block
1174
movq mm0, mmword ptr scratch3
1175
psraw mm7, 4
1176
 
1177
;moved up from next block
1178
movq mm6, mmword ptr scratch5
1179
psubsw mm1, mm4                                 ; out14 ; free mm4
1180
 
1181
paddsw mm5, mm3                                 ; out2
1182
psraw mm1, 4
1183
 
1184
movq mmword ptr[esi], mm7                       ; out0 ; free mm7
1185
psraw mm5, 4
1186
 
1187
movq mmword ptr[esi+8*14], mm1                  ; out14 ; free mm1
1188
psubsw mm2, mm3                                 ; out12 ; free mm3
1189
 
1190
movq mmword ptr[esi+8*2], mm5                   ; out2 ; free mm5
1191
psraw mm2, 4
1192
 
1193
;moved up to the prev block
1194
movq mm4, mmword ptr scratch7
1195
;moved up to the prev block
1196
psraw mm0, 4
1197
 
1198
movq mmword ptr[esi+8*12], mm2                  ; out12 ; free mm2
1199
;moved up to the prev block
1200
psraw mm6, 4
1201
 
1202
;move back the data to its correct place
1203
;moved up to the prev block
1204
;movq mm0, mmword ptr scratch3
1205
;movq mm6, mmword ptr scratch5
1206
;movq mm4, mmword ptr scratch7
1207
;psraw mm0, 4
1208
;psraw mm6, 4
1209
 
1210
movq mm1, mmword ptr scratch1
1211
psraw mm4, 4
1212
 
1213
movq mmword ptr [esi+8*3], mm0          ; out3
1214
psraw mm1, 4
1215
 
1216
movq mmword ptr [esi+8*5], mm6          ; out5
1217
;slot
1218
 
1219
movq mmword ptr [esi+8*7], mm4          ; out7
1220
;slot
1221
 
1222
movq mmword ptr [esi+8*1], mm1          ; out1
1223
;slot
1224
 
1225
emms
1226
 
1227
pop esi
1228
pop ebp
1229
 
1230
ret     0
1231
 
1232
_idct8x8aan ENDP
1233
_TEXT ENDS
1234
 
1235
END
1236
 
1237
* Legal Information © 1998 Intel Corporation

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.