| Branch: | Revision:

ffmpeg / libavcodec / x86 / vp8dsp.asm @ 0cc8a5d0

History | View | Annotate | Download (78.4 KB)

# Date Author Comment
b1c32fb5 09/05/2010 10:10 AM Reimar Döffinger

Use "d" suffix for general-purpose registers used with movd.
This increases compatibilty with nasm and is also more consistent,
e.g. with h264_intrapred.asm and h264_chromamc.asm that already
do it that way.

Originally committed as revision 25042 to svn://

3611c45a 08/24/2010 04:52 PM Ronald S. Bultje

Mark xmm registers as clobbered in simple loopfilter. Should fix the last
two VP8-related fate failures on Win64.

Originally committed as revision 24908 to svn://

684d608b 08/23/2010 02:41 AM Ronald S. Bultje

Fix segfaults in VP8 SIMD code on Win64 (and FATE/win64 failures).

Originally committed as revision 24871 to svn://

827d43bb 08/02/2010 08:18 PM Jason Garrett-Glaser

VP8: move zeroing of luma DC block into the WHT
Lets us do the zeroing in asm instead of C.
Also makes it consistent with the way the regular iDCT code does it.

Originally committed as revision 24668 to svn://

6341838f 07/31/2010 11:13 PM Ronald S. Bultje

Use word-writing instead of dword-writing (with two cached but otherwise
unchanged bytes) in the horizontal simple loopfilter. This makes the filter
quite a bit faster in itself (~30 cycles less on Core1), probably mostly
because we don't need a complex 4x4 transpose, but only a simple byte...

ab4d0318 07/26/2010 09:18 PM Ronald S. Bultje

Use pmaddubsw for the mbedge_filter (>=ssse3), 6-10 cycles faster.

Originally committed as revision 24514 to svn://

e25dee60 07/26/2010 07:34 PM Jason Garrett-Glaser

VP8: Much faster SSE2 MC
5-10% faster or more on Phenom, Athlon 64, and some others.
Helps some on pre-SSSE3 Intel chips as well, but not as much.

Originally committed as revision 24513 to svn://

48adb7e7 07/26/2010 02:07 PM Ronald S. Bultje

Enable no-loop memory/register saving for ssse3/sse4 also.

Originally committed as revision 24511 to svn://

2a180c69 07/26/2010 02:00 PM Ronald S. Bultje

Save a register (or regsize of stackspace for x86-32) for the no-loop
mbedge loopfilter functions, by re-using space that holds a variable
that we no longer need.

Originally committed as revision 24510 to svn://

bcd4aa64 07/26/2010 01:56 PM Ronald S. Bultje

Use nested ifs instead of &&, which appears to not work with %ifidn (i.e. this
construct was always enabled, even for <ssse3 versions).

Originally committed as revision 24509 to svn://

2208053b 07/26/2010 01:50 PM Ronald S. Bultje

Split pextrw macro-spaghetti into several opt-specific macros, this will make
future new optimizations (imagine a sse5) much easier. Also fix a bug where
we used the direction (%2) rather than optimization (%1) to enable this, which
means it wasn't ever actually used......

6de5b7c6 07/25/2010 02:42 AM Ronald S. Bultje

Fix obvious bug in assignment. Somehow, the test vectors don't test this...

Originally committed as revision 24489 to svn://

e3f7bf77 07/24/2010 07:33 PM Ronald S. Bultje

Fix SPLATB_REG mess. Used to be a if/elseif/elseif/elseif spaghetti, so this
splits it into small optimization-specific macros which are selected for each
DSP function. The advantage of this approach is that the sse4 functions now
use the ssse3 codepath also without needing an explicit sse4 codepath....

3ae079a3 07/23/2010 06:02 AM Jason Garrett-Glaser

VP8: optimize DC-only chroma case in the same way as luma.
Add MMX idct_dc_add4uv function for this case.
~40% faster chroma idct.

Originally committed as revision 24455 to svn://

51c91564 07/23/2010 03:02 AM Jason Garrett-Glaser

VP8 asm: cosmetics (spacing)

Originally committed as revision 24453 to svn://

8a467b2d 07/23/2010 02:58 AM Jason Garrett-Glaser

VP8: 30% faster idct_mb
Take shortcuts based on statistically common situations.
Add 4-at-a-time idct_dc function (mmx and sse2) since rows of 4 DC-only DCT
blocks are common.
TODO: tie this more directly into the MB mode, since the DC-level transform is
only used for non-splitmv blocks?...

c25c7767 07/23/2010 12:07 AM Jason Garrett-Glaser

VP8: clear DCT blocks in iDCT instead of using clear_blocks.
~0.3% faster overall.

Originally committed as revision 24448 to svn://

dc5eec80 07/22/2010 07:59 PM Ronald S. Bultje

Use pextrw for SSE4 mbedge filter result writing, speedup 5-10cycles on
CPUs supporting it.

Originally committed as revision 24437 to svn://

003243c3 07/22/2010 01:35 AM Ronald S. Bultje

Fix and enable horizontal >=SSE2 mbedge loopfilter.

Originally committed as revision 24409 to svn://

8731dbd8 07/21/2010 10:41 PM Jason Garrett-Glaser

Eliminate one instruction in VP8 dc_add_sse4

Originally committed as revision 24405 to svn://

7dd224a4 07/21/2010 10:11 PM Jason Garrett-Glaser

Various VP8 x86 deblocking speedups
SSSE3 versions, improve SSE2 versions a bit.
SSE2/SSSE3 mbedge h functions are currently broken, so explicitly disable them.

Originally committed as revision 24403 to svn://

b8b231b5 07/21/2010 08:51 PM Jason Garrett-Glaser

Make mmx VP8 WHT faster
Avoid pextrw, since it's slow on many older CPUs.
Now it doesn't require mmxext either.

Originally committed as revision 24397 to svn://

e9e456d8 07/20/2010 10:58 PM Ronald S. Bultje

VP8 MBedge loopfilter MMX/MMX2/SSE2 functions for both luma (width=16)
and chroma (width=8).

Originally committed as revision 24378 to svn://

268821e7 07/20/2010 10:04 PM Ronald S. Bultje

Chroma (width=8) inner loopfilter MMX/MMX2/SSE2 for VP8 decoder.

Originally committed as revision 24377 to svn://

c60ed66d 07/19/2010 11:57 PM Ronald S. Bultje

Revert r24339 (it causes fate failures on x86-64) - I'll figure out what's
wrong with it tomorrow or so, then re-submit.

Originally committed as revision 24341 to svn://

1878f685 07/19/2010 09:53 PM Ronald S. Bultje

Implement chroma (width=8) inner loopfilter MMX/MMX2/SSE2 functions.

Originally committed as revision 24339 to svn://

fb9bdf04 07/19/2010 09:45 PM Ronald S. Bultje

Be more efficient with registers or stack memory. Saves 8/16 bytes stack
for x86-32, or 2 MM registers on x86-64.

Originally committed as revision 24338 to svn://

3facfc99 07/19/2010 09:18 PM Ronald S. Bultje

Change function prototypes for width=8 inner and mbedge loopfilter functions
so that it does both U and V planes at the same time. This will have speed
advantages when using SSE2 (or higher) optimizations, since we can do both
the U and V rows together in a single xmm register....

819b2dd2 07/16/2010 09:35 PM Ronald S. Bultje

Attempt to fix x86-64 testsuite on fate.

Originally committed as revision 24275 to svn://

6f323f12 07/16/2010 07:54 PM Ronald S. Bultje

Remove duplicate define.

Originally committed as revision 24272 to svn://

889b2c26 07/16/2010 07:54 PM Ronald S. Bultje

Revert 24270, it contained some stuff that shouldn't have been in there.

Originally committed as revision 24271 to svn://

2356a783 07/16/2010 07:42 PM Ronald S. Bultje

Remove duplicate define.

Originally committed as revision 24270 to svn://

ede1b966 07/16/2010 07:38 PM Ronald S. Bultje

Give x86 r%d registers names, this will simplify implementation of the chroma
inner loopfilter, and it also allows us to save one register on x86-64/sse2.

Originally committed as revision 24269 to svn://

526e831a 07/16/2010 06:29 PM Ronald S. Bultje

Change return statement, the REP_RET is a mistake since the else case (x86-64,
sse2) doesn't actually loop, so REP_RET isn't necessary.

Originally committed as revision 24268 to svn://

a711eb48 07/15/2010 11:02 PM Ronald S. Bultje

VP8 H/V inner loopfilter MMX/MMXEXT/SSE2 optimizations.

Originally committed as revision 24250 to svn://

f2a30bd8 07/03/2010 07:26 PM Ronald S. Bultje

Simple H/V loopfilter for VP8 in MMX, MMX2 and SSE2 (yay for yasm macros).

Originally committed as revision 24029 to svn://

b06855f1 07/03/2010 12:48 AM Jason Garrett-Glaser

SSSE3 versions of vp8 width4 bilinear MC functions

Originally committed as revision 24013 to svn://

dcc602d8 07/02/2010 05:27 AM Jason Garrett-Glaser

SSSE3 versions of width4 VP8 6-tap MC functions
Also make some small changes to saturation order of 4-tap SSSE3 MC to fix a
non-bitexactness bug.

Patch mostly by Eli Friedman <eli.friedman AT gmail DOT com>.

Originally committed as revision 23965 to svn://

82a8d0f1 06/29/2010 05:23 PM Jason Garrett-Glaser

Use add instead of lshift in mmxext vp8 idct

Originally committed as revision 23891 to svn://

565344e7 06/29/2010 05:04 PM Ronald S. Bultje

Remove unused macros (duplicates from the now-LGPL x86util.asm).

Originally committed as revision 23890 to svn://

2dd2f716 06/29/2010 02:43 PM Ronald S. Bultje

MMX idct_add for VP8.

Originally committed as revision 23886 to svn://

004cda8e 06/29/2010 01:41 AM Jason Garrett-Glaser

Add mmxext version of VP8 DC Hadamard transform

Originally committed as revision 23878 to svn://

a912da76 06/28/2010 10:13 PM Jason Garrett-Glaser

Fix VP8 bilinear mc on x86_64

Originally committed as revision 23872 to svn://

0fecad09 06/28/2010 07:14 PM Jason Garrett-Glaser

Add x86 asm functions for VP8 put_pixels

Originally committed as revision 23858 to svn://

a173aa89 06/28/2010 06:56 PM Jason Garrett-Glaser

Add MMX, SSE2, SSSE3 asm for VP8 bilinear MC

Originally committed as revision 23857 to svn://

0178d14f 06/27/2010 02:01 AM Jason Garrett-Glaser

First shot at VP8 optimizations:
- MMXEXT, SSE2 and SSSE3 MC functions
- MMX and SSE4 IDCT dc_add functions

Patch by Jason Garrett-Glaser <darkshikari gmail com> and myself.

Originally committed as revision 23815 to svn://