H.264: split luma dc idct out and implement MMX/SSE2 versionsAbout 2.5x the speed.
NOTE: the way that the asm code handles large qmuls is a bit suboptimal.If x264-style dequant was used (separate shift and qmul values), it mightbe possible to get some extra speed....
Add d suffix to movd target register to make it work with nasm.
Originally committed as revision 25206 to svn://svn.ffmpeg.org/ffmpeg/trunk
Unroll loop in h264_idct_add16intra_sse2(). Basically identical to r25171, thisinlines scan8 and removes loop setup. 15% faster, 0.4% overall.
See "[PATCH] unroll loop in h264_idct_add8_sse2()" thread on ML.
Originally committed as revision 25172 to svn://svn.ffmpeg.org/ffmpeg/trunk
Unroll loop in h264_idct_add8_sse2(). This means we can inline scan8 in thecode directly also and remove loop setup. 20% faster in function, 0.8% overall.
Originally committed as revision 25171 to svn://svn.ffmpeg.org/ffmpeg/trunk
Rename h264_idct_sse2.asm to h264_idct.asm; move inline IDCT asm fromh264dsp_mmx.c to h264_idct.asm (as yasm code). Because the loops are nowcoded in asm instead of C, this is (depending on the function) up to 50%faster for cases where gcc didn't do a great job at looping....