| Branch: | Revision:

ffmpeg / libavcodec / x86 @ ae112918

# Date Author Comment
ae112918 09/24/2010 02:07 PM Ronald S. Bultje

Unroll loop in h264_idct_add16intra_sse2(). Basically identical to r25171, this
inlines scan8[] and removes loop setup. 15% faster, 0.4% overall.

See "[PATCH] unroll loop in h264_idct_add8_sse2()" thread on ML.

Originally committed as revision 25172 to svn://

4bca6774 09/24/2010 02:05 PM Ronald S. Bultje

Unroll loop in h264_idct_add8_sse2(). This means we can inline scan8[] in the
code directly also and remove loop setup. 20% faster in function, 0.8% overall.

See "[PATCH] unroll loop in h264_idct_add8_sse2()" thread on ML.

Originally committed as revision 25171 to svn://

c0bc8b9a 09/21/2010 05:57 PM Måns Rullgård

x86: disable SSE functions using stack when stack is not aligned

This fixes crashes with ICC 10.1.

Originally committed as revision 25153 to svn://

f41237c9 09/18/2010 08:44 PM Måns Rullgård

x86: remove hack disabling sse2 h264 loop filter with 32-bit icc

Originally committed as revision 25146 to svn://

ada65af9 09/17/2010 12:24 PM Ronald S. Bultje

Don't access upper 32 bits of a 32-bit int on 64-bit systems.

Originally committed as revision 25140 to svn://

6c3d0218 09/17/2010 03:01 AM Ronald S. Bultje

Properly add HAVE_YASM around yasmified symbols. Should fix compile error
on configurations using --disable-yasm.

Originally committed as revision 25138 to svn://

e2e34104 09/17/2010 01:56 AM Ronald S. Bultje

Move hadamard_diff{,16}_{mmx,mmx2,sse2,ssse3}() from inline asm to yasm,
which will hopefully solve the Win64/FATE failures caused by these functions.

Originally committed as revision 25137 to svn://

d0acc2d2 09/17/2010 01:44 AM Ronald S. Bultje

Move sse16_sse2() from inline asm to yasm. It is one of the functions causing
Win64/FATE issues.

Originally committed as revision 25136 to svn://

1d16a1cf 09/14/2010 01:36 PM Ronald S. Bultje

Rename h264_idct_sse2.asm to h264_idct.asm; move inline IDCT asm from
h264dsp_mmx.c to h264_idct.asm (as yasm code). Because the loops are now
coded in asm instead of C, this is (depending on the function) up to 50%
faster for cases where gcc didn't do a great job at looping....

8acb554a 09/10/2010 02:25 AM Jason Garrett-Glaser

This leaves no more GPL-only H.264 decoding asm code.

Approved by Loren.

Originally committed as revision 25092 to svn://

c6c98d08 09/08/2010 03:07 PM Stefano Sabatini

Move mm_support() from libavcodec to libavutil, make it a public
function and rename it to av_get_cpu_flags().

Originally committed as revision 25076 to svn://

b1c32fb5 09/05/2010 10:10 AM Reimar Döffinger

Use "d" suffix for general-purpose registers used with movd.
This increases compatibilty with nasm and is also more consistent,
e.g. with h264_intrapred.asm and h264_chromamc.asm that already
do it that way.

Originally committed as revision 25042 to svn://

7160bb71 09/04/2010 09:59 AM Stefano Sabatini

Rename FF_MM_ symbols related to CPU features flags as AV_CPU_FLAG_
symbols, and move them from libavcodec/avcodec.h to libavutil/cpu.h.

Originally committed as revision 25040 to svn://

2c166c3a 09/03/2010 04:52 PM Ronald S. Bultje

Port latest x264 deblock asm (before they moved to using NV12 as internal
format), LGPL'ed with permission from Jason and Loren. This includes mmx2
code, so remove inline asm from h264dsp_mmx.c accordingly.

Originally committed as revision 25031 to svn://

a10a9f5c 09/01/2010 11:19 PM Eli Friedman

Fix typo in r25019.

Patch by Eli Friedman <eli.friedman at gmail dot com>.

Originally committed as revision 25022 to svn://

615da9b1 09/01/2010 09:10 PM Ronald S. Bultje

Unscrew breakage after my last commit because of symbol prefixes.

Originally committed as revision 25020 to svn://

a33a2562 09/01/2010 08:56 PM Ronald S. Bultje

Rename h264_weight_sse2.asm to h264_weight.asm; add 16x8/8x16/8x4 non-square
biweight code to sse2/ssse3; add sse2 weight code; and use that same code to
create mmx2 functions also, so that the inline asm in h264dsp_mmx.c can be
removed. OK'ed by Jason on IRC....

14bc1f24 09/01/2010 08:48 PM Ronald S. Bultje

Split h264dsp_mmx.c (which was #included in dsputil_mmx.c) in h264_qpel_mmx.c,
still #included in dsputil_mmx.c and is part of DSPContext, and h264dsp_mmx.c,
which represents H264DSPContext and is now compiled on its own.

Originally committed as revision 25018 to svn://

5929b3a6 08/31/2010 12:32 PM Ronald S. Bultje

Fix vertical align.

Originally committed as revision 25009 to svn://

79ce0f00 08/30/2010 08:30 PM Ronald S. Bultje

Fix compilation failure if yasm is disabled (missing vp3 symbols).

Originally committed as revision 24992 to svn://

de1c253b 08/30/2010 04:34 PM Ronald S. Bultje

Split intra prediction initialization (i.e. assigning of function pointers)
into its own file, it doesn't belong in h264dsp_mmx.c (much less so in

Originally committed as revision 24990 to svn://

d0eb5a11 08/30/2010 04:31 PM Ronald S. Bultje

Move H264 chroma MC from inline asm to yasm. This fixes VP3/5/6 and VC-1
fate failures on Win64.

Originally committed as revision 24989 to svn://

e9f5f020 08/30/2010 04:25 PM Ronald S. Bultje

Move VP3 IDCT functions from inline ASM to YASM. This fixes part of the VP3/5/6
issues on Win64.

Originally committed as revision 24988 to svn://

7e7c4b60 08/30/2010 04:22 PM Ronald S. Bultje

Put ff_ prefix on non-static {put_signed,put,add}_pixels_clamped_mmx()

Originally committed as revision 24987 to svn://

19d929f9 08/28/2010 09:03 PM Loren Merritt

cosmetics in imdct_sse

Originally committed as revision 24958 to svn://

4eca52ed 08/26/2010 02:33 PM Ronald S. Bultje

Fix typos when converting inline asm to yasm, fixes MMX-only fate-ea-vp61.

Originally committed as revision 24948 to svn://

6697bc33 08/25/2010 08:36 PM Ronald S. Bultje

Revert r24931, it broke Win32 and some BSD compiles (yay fate).

Originally committed as revision 24934 to svn://

72f64240 08/25/2010 07:57 PM Ronald S. Bultje

Mark xmm6 and xmm7 as clobbered in ff_vp3_idct_sse2(), which is contributing
to the VP6 fate failures on Win64.

Originally committed as revision 24931 to svn://

69dad87c 08/25/2010 03:41 PM Måns Rullgård

VP6: fix vp6_filter_diag4_mmx/sse on 64-bit

The stride can be negative and must be sign extended before being
used in pointer arithmetic.

Originally committed as revision 24926 to svn://

89fa3504 08/25/2010 01:44 PM Ronald S. Bultje

Move vp6_filter_diag4() x86 SIMD code from inline ASM to YASM. This should
help in fixing the Win64 fate failures.

Originally committed as revision 24922 to svn://

3a088514 08/25/2010 01:42 PM Ronald S. Bultje

Move vp6_filter_diag4() from DSPContext to VP56DSPContext.

Originally committed as revision 24921 to svn://

c0ec9918 08/24/2010 05:47 PM Måns Rullgård

Remove global mm_flags variable

Originally committed as revision 24909 to svn://

3611c45a 08/24/2010 04:52 PM Ronald S. Bultje

Mark xmm registers as clobbered in simple loopfilter. Should fix the last
two VP8-related fate failures on Win64.

Originally committed as revision 24908 to svn://

cb4f1246 08/23/2010 03:51 PM Alex Converse

imdct/x86: Use "s->mdct_size" instead of "1 << s->mdct_bits".

It generates smaller cleaner code.

Originally committed as revision 24887 to svn://

684d608b 08/23/2010 02:41 AM Ronald S. Bultje

Fix segfaults in VP8 SIMD code on Win64 (and FATE/win64 failures).

Originally committed as revision 24871 to svn://

78b5c97d 08/22/2010 02:39 PM Alex Converse

Convert ff_imdct_half_sse() to yasm.

This is to avoid split asm sections that attempt to preserve some
registers between sections.

Originally committed as revision 24869 to svn://

05c04cdf 08/12/2010 01:11 AM Jason Garrett-Glaser

VP5/6/8: ~7% faster arithmetic decoding
Grab from the bitstream in 16-bit chunks instead of 8-bit chunks.
TODO: grab in 32-bit chunks on 64-bit systems.

Originally committed as revision 24783 to svn://

4a384de5 08/07/2010 11:10 PM Jason Garrett-Glaser

Split h264dsp and h264pred in configure.
Many H.264 derivatives, like RV40 and VP8, use the H.264 prediction functions
but not the weight/loopfilter functions.
This should reduce the size of builds with one of these derivatives but without
H.264 decoding itself....

98fe09df 08/05/2010 12:49 AM Jason Garrett-Glaser

Add file missing in r24702

Originally committed as revision 24703 to svn://

c12d6955 08/05/2010 12:13 AM Eli Friedman

H.264: SSE2/SSSE3 weighted prediction asm
Patch by Eli Friedman <eli.friedman at gmail dot com>

Originally committed as revision 24702 to svn://

f079a64a 08/03/2010 08:59 PM Måns Rullgård

Move cavs dsp functions to their own struct

Originally committed as revision 24685 to svn://

8b9b5e08 08/03/2010 11:21 AM Jason Garrett-Glaser

VP5/6/8: add one inline missed in r24677

Originally committed as revision 24682 to svn://

827d43bb 08/02/2010 08:18 PM Jason Garrett-Glaser

VP8: move zeroing of luma DC block into the WHT
Lets us do the zeroing in asm instead of C.
Also makes it consistent with the way the regular iDCT code does it.

Originally committed as revision 24668 to svn://

6341838f 07/31/2010 11:13 PM Ronald S. Bultje

Use word-writing instead of dword-writing (with two cached but otherwise
unchanged bytes) in the horizontal simple loopfilter. This makes the filter
quite a bit faster in itself (~30 cycles less on Core1), probably mostly
because we don't need a complex 4x4 transpose, but only a simple byte...

fa738b3a 07/31/2010 04:20 PM Vitor Sessak

Remove x86/mmx.h. It is not used anymore and has been deprecated for years.

Originally committed as revision 24618 to svn://

de4bc44a 07/31/2010 02:50 PM Vitor Sessak

Convert deinterlacing MMX code to YASM

Originally committed as revision 24615 to svn://

740dfe70 07/29/2010 10:45 PM Vitor Sessak

Fix compilation in x86_64. I broke it with r24580.

Originally committed as revision 24582 to svn://

2c3dda68 07/29/2010 10:19 PM Vitor Sessak

Translate libmpeg2 MMX IDCT to plain asm

Originally committed as revision 24580 to svn://

ab4d0318 07/26/2010 09:18 PM Ronald S. Bultje

Use pmaddubsw for the mbedge_filter (>=ssse3), 6-10 cycles faster.

Originally committed as revision 24514 to svn://

e25dee60 07/26/2010 07:34 PM Jason Garrett-Glaser

VP8: Much faster SSE2 MC
5-10% faster or more on Phenom, Athlon 64, and some others.
Helps some on pre-SSSE3 Intel chips as well, but not as much.

Originally committed as revision 24513 to svn://

48adb7e7 07/26/2010 02:07 PM Ronald S. Bultje

Enable no-loop memory/register saving for ssse3/sse4 also.

Originally committed as revision 24511 to svn://

2a180c69 07/26/2010 02:00 PM Ronald S. Bultje

Save a register (or regsize of stackspace for x86-32) for the no-loop
mbedge loopfilter functions, by re-using space that holds a variable
that we no longer need.

Originally committed as revision 24510 to svn://

bcd4aa64 07/26/2010 01:56 PM Ronald S. Bultje

Use nested ifs instead of &&, which appears to not work with %ifidn (i.e. this
construct was always enabled, even for <ssse3 versions).

Originally committed as revision 24509 to svn://

2208053b 07/26/2010 01:50 PM Ronald S. Bultje

Split pextrw macro-spaghetti into several opt-specific macros, this will make
future new optimizations (imagine a sse5) much easier. Also fix a bug where
we used the direction (%2) rather than optimization (%1) to enable this, which
means it wasn't ever actually used......

6de5b7c6 07/25/2010 02:42 AM Ronald S. Bultje

Fix obvious bug in assignment. Somehow, the test vectors don't test this...

Originally committed as revision 24489 to svn://

e3f7bf77 07/24/2010 07:33 PM Ronald S. Bultje

Fix SPLATB_REG mess. Used to be a if/elseif/elseif/elseif spaghetti, so this
splits it into small optimization-specific macros which are selected for each
DSP function. The advantage of this approach is that the sse4 functions now
use the ssse3 codepath also without needing an explicit sse4 codepath....

3611e7a3 07/23/2010 09:46 PM Eli Friedman

Inline asm for VP56 arith coder

This is a lot more reliable to get cmov rather than trying to trick gcc into
generating it, useful since it's 2% faster overall.

Patch by Eli Friedman <eli.friedman at gmail>

Originally committed as revision 24471 to svn://

3ae079a3 07/23/2010 06:02 AM Jason Garrett-Glaser

VP8: optimize DC-only chroma case in the same way as luma.
Add MMX idct_dc_add4uv function for this case.
~40% faster chroma idct.

Originally committed as revision 24455 to svn://

51c91564 07/23/2010 03:02 AM Jason Garrett-Glaser

VP8 asm: cosmetics (spacing)

Originally committed as revision 24453 to svn://

8a467b2d 07/23/2010 02:58 AM Jason Garrett-Glaser

VP8: 30% faster idct_mb
Take shortcuts based on statistically common situations.
Add 4-at-a-time idct_dc function (mmx and sse2) since rows of 4 DC-only DCT
blocks are common.
TODO: tie this more directly into the MB mode, since the DC-level transform is
only used for non-splitmv blocks?...

c25c7767 07/23/2010 12:07 AM Jason Garrett-Glaser

VP8: clear DCT blocks in iDCT instead of using clear_blocks.
~0.3% faster overall.

Originally committed as revision 24448 to svn://

dc5eec80 07/22/2010 07:59 PM Ronald S. Bultje

Use pextrw for SSE4 mbedge filter result writing, speedup 5-10cycles on
CPUs supporting it.

Originally committed as revision 24437 to svn://

003243c3 07/22/2010 01:35 AM Ronald S. Bultje

Fix and enable horizontal >=SSE2 mbedge loopfilter.

Originally committed as revision 24409 to svn://

c7b1d976 07/22/2010 12:39 AM Loren Merritt

relicense h264 deblock sse2 to lgpl

Originally committed as revision 24408 to svn://

532e7697 07/21/2010 10:45 PM Loren Merritt

sync yasm macros from x264

Originally committed as revision 24406 to svn://

8731dbd8 07/21/2010 10:41 PM Jason Garrett-Glaser

Eliminate one instruction in VP8 dc_add_sse4

Originally committed as revision 24405 to svn://

7dd224a4 07/21/2010 10:11 PM Jason Garrett-Glaser

Various VP8 x86 deblocking speedups
SSSE3 versions, improve SSE2 versions a bit.
SSE2/SSSE3 mbedge h functions are currently broken, so explicitly disable them.

Originally committed as revision 24403 to svn://

b8b231b5 07/21/2010 08:51 PM Jason Garrett-Glaser

Make mmx VP8 WHT faster
Avoid pextrw, since it's slow on many older CPUs.
Now it doesn't require mmxext either.

Originally committed as revision 24397 to svn://

af521abc 07/21/2010 10:02 AM David Conrad

Add header declarations for mmx/sse constants missing them

Originally committed as revision 24381 to svn://

c7eec581 07/21/2010 10:02 AM David Conrad

Move ff_pw_* from vc1dsp_mmx.c to dsputil_mmx.c

Should fix compilation with icc and should help prevent any future duplicates

Originally committed as revision 24380 to svn://

e9e456d8 07/20/2010 10:58 PM Ronald S. Bultje

VP8 MBedge loopfilter MMX/MMX2/SSE2 functions for both luma (width=16)
and chroma (width=8).

Originally committed as revision 24378 to svn://

268821e7 07/20/2010 10:04 PM Ronald S. Bultje

Chroma (width=8) inner loopfilter MMX/MMX2/SSE2 for VP8 decoder.

Originally committed as revision 24377 to svn://

c60ed66d 07/19/2010 11:57 PM Ronald S. Bultje

Revert r24339 (it causes fate failures on x86-64) - I'll figure out what's
wrong with it tomorrow or so, then re-submit.

Originally committed as revision 24341 to svn://

6526976f 07/19/2010 10:38 PM Ronald S. Bultje

Remove FF_MM_SSE2/3 flags for CPUs where this is generally not faster than
regular MMX code. Examples of this are the Core1 CPU. Instead, set a new flag,
FF_MM_SSE2/3SLOW, which can be checked for particular SSE2/3 functions that
have been checked specifically on such CPUs and are actually faster than...

1878f685 07/19/2010 09:53 PM Ronald S. Bultje

Implement chroma (width=8) inner loopfilter MMX/MMX2/SSE2 functions.

Originally committed as revision 24339 to svn://

fb9bdf04 07/19/2010 09:45 PM Ronald S. Bultje

Be more efficient with registers or stack memory. Saves 8/16 bytes stack
for x86-32, or 2 MM registers on x86-64.

Originally committed as revision 24338 to svn://

3facfc99 07/19/2010 09:18 PM Ronald S. Bultje

Change function prototypes for width=8 inner and mbedge loopfilter functions
so that it does both U and V planes at the same time. This will have speed
advantages when using SSE2 (or higher) optimizations, since we can do both
the U and V rows together in a single xmm register....

1ee076b1 07/18/2010 08:06 PM Loren Merritt

more credits to D. J. Bernstein for fft

Originally committed as revision 24308 to svn://

819b2dd2 07/16/2010 09:35 PM Ronald S. Bultje

Attempt to fix x86-64 testsuite on fate.

Originally committed as revision 24275 to svn://

6f323f12 07/16/2010 07:54 PM Ronald S. Bultje

Remove duplicate define.

Originally committed as revision 24272 to svn://

889b2c26 07/16/2010 07:54 PM Ronald S. Bultje

Revert 24270, it contained some stuff that shouldn't have been in there.

Originally committed as revision 24271 to svn://

2356a783 07/16/2010 07:42 PM Ronald S. Bultje

Remove duplicate define.

Originally committed as revision 24270 to svn://

ede1b966 07/16/2010 07:38 PM Ronald S. Bultje

Give x86 r%d registers names, this will simplify implementation of the chroma
inner loopfilter, and it also allows us to save one register on x86-64/sse2.

Originally committed as revision 24269 to svn://

526e831a 07/16/2010 06:29 PM Ronald S. Bultje

Change return statement, the REP_RET is a mistake since the else case (x86-64,
sse2) doesn't actually loop, so REP_RET isn't necessary.

Originally committed as revision 24268 to svn://

a711eb48 07/15/2010 11:02 PM Ronald S. Bultje

VP8 H/V inner loopfilter MMX/MMXEXT/SSE2 optimizations.

Originally committed as revision 24250 to svn://

faa26db2 07/11/2010 10:53 PM David Conrad

MMX/SSE VC1 loop filter

Originally committed as revision 24208 to svn://

7af8fbd3 07/11/2010 10:52 PM David Conrad

Make ff_pw_4 128 bits

Originally committed as revision 24207 to svn://

881fd7a6 07/06/2010 05:48 PM Vitor Sessak

Move SSE optimized 32-point DCT to its own file. Should fix breakage with YASM

Originally committed as revision 24078 to svn://

4dcc4f8e 07/06/2010 04:58 PM Vitor Sessak

SSE optimized 32-point DCT

Originally committed as revision 24077 to svn://

f2a30bd8 07/03/2010 07:26 PM Ronald S. Bultje

Simple H/V loopfilter for VP8 in MMX, MMX2 and SSE2 (yay for yasm macros).

Originally committed as revision 24029 to svn://

b06855f1 07/03/2010 12:48 AM Jason Garrett-Glaser

SSSE3 versions of vp8 width4 bilinear MC functions

Originally committed as revision 24013 to svn://

dcc602d8 07/02/2010 05:27 AM Jason Garrett-Glaser

SSSE3 versions of width4 VP8 6-tap MC functions
Also make some small changes to saturation order of 4-tap SSSE3 MC to fix a
non-bitexactness bug.

Patch mostly by Eli Friedman <eli.friedman AT gmail DOT com>.

Originally committed as revision 23965 to svn://

8434fc26 07/01/2010 10:09 PM Jason Garrett-Glaser

Fix 100L in vp8dsp asm init

Originally committed as revision 23946 to svn://

17dc7c7a 07/01/2010 10:29 AM Jason Garrett-Glaser

Fix h264/vp8 intra pred on Athlon XP
Whose idea was it to have a CPU that didn't SIGILL on an invalid instruction?

Originally committed as revision 23927 to svn://

49bd8e4b 06/30/2010 03:38 PM Måns Rullgård

Fix grammar errors in documentation

Originally committed as revision 23904 to svn://

82a8d0f1 06/29/2010 05:23 PM Jason Garrett-Glaser

Use add instead of lshift in mmxext vp8 idct

Originally committed as revision 23891 to svn://

565344e7 06/29/2010 05:04 PM Ronald S. Bultje

Remove unused macros (duplicates from the now-LGPL x86util.asm).

Originally committed as revision 23890 to svn://

2dd2f716 06/29/2010 02:43 PM Ronald S. Bultje

MMX idct_add for VP8.

Originally committed as revision 23886 to svn://

29e71937 06/29/2010 12:28 PM Jason Garrett-Glaser

Add missing mm_support call toff_h264_pred_init_x86.
I'm not sure if this is supposed to be here, but it can't hurt.

Originally committed as revision 23885 to svn://

004cda8e 06/29/2010 01:41 AM Jason Garrett-Glaser

Add mmxext version of VP8 DC Hadamard transform

Originally committed as revision 23878 to svn://