Revision 8a322796 libswscale/yuv2rgb_altivec.c
libswscale/yuv2rgb_altivec.c | ||
---|---|---|
21 | 21 |
*/ |
22 | 22 |
|
23 | 23 |
/* |
24 |
convert I420 YV12 to RGB in various formats,
|
|
25 |
it rejects images that are not in 420 formats |
|
26 |
it rejects images that don't have widths of multiples of 16 |
|
27 |
it rejects images that don't have heights of multiples of 2 |
|
28 |
reject defers to C simulation codes.
|
|
24 |
Convert I420 YV12 to RGB in various formats,
|
|
25 |
it rejects images that are not in 420 formats,
|
|
26 |
it rejects images that don't have widths of multiples of 16,
|
|
27 |
it rejects images that don't have heights of multiples of 2.
|
|
28 |
Reject defers to C simulation code.
|
|
29 | 29 |
|
30 |
lots of optimizations to be done here
|
|
30 |
Lots of optimizations to be done here.
|
|
31 | 31 |
|
32 |
1. need to fix saturation code, I just couldn't get it to fly with packs and adds.
|
|
33 |
so we currently use max min to clip
|
|
32 |
1. Need to fix saturation code. I just couldn't get it to fly with packs
|
|
33 |
and adds, so we currently use max/min to clip.
|
|
34 | 34 |
|
35 |
2. the inefficient use of chroma loading needs a bit of brushing up
|
|
35 |
2. The inefficient use of chroma loading needs a bit of brushing up.
|
|
36 | 36 |
|
37 |
3. analysis of pipeline stalls needs to be done, use shark to identify pipeline stalls |
|
37 |
3. Analysis of pipeline stalls needs to be done. Use shark to identify |
|
38 |
pipeline stalls. |
|
38 | 39 |
|
39 | 40 |
|
40 | 41 |
MODIFIED to calculate coeffs from currently selected color space. |
41 |
MODIFIED core to be a macro which you spec the output format.
|
|
42 |
ADDED UYVY conversion which is never called due to some thing in SWSCALE.
|
|
42 |
MODIFIED core to be a macro where you specify the output format.
|
|
43 |
ADDED UYVY conversion which is never called due to some thing in swscale.
|
|
43 | 44 |
CORRECTED algorithim selection to be strict on input formats. |
44 |
ADDED runtime detection of altivec.
|
|
45 |
ADDED runtime detection of AltiVec.
|
|
45 | 46 |
|
46 | 47 |
ADDED altivec_yuv2packedX vertical scl + RGB converter |
47 | 48 |
|
48 | 49 |
March 27,2004 |
49 | 50 |
PERFORMANCE ANALYSIS |
50 | 51 |
|
51 |
The C version use 25% of the processor or ~250Mips for D1 video rawvideo used as test |
|
52 |
The ALTIVEC version uses 10% of the processor or ~100Mips for D1 video same sequence |
|
52 |
The C version uses 25% of the processor or ~250Mips for D1 video rawvideo |
|
53 |
used as test. |
|
54 |
The AltiVec version uses 10% of the processor or ~100Mips for D1 video |
|
55 |
same sequence. |
|
53 | 56 |
|
54 |
720*480*30 ~10MPS
|
|
57 |
720 * 480 * 30 ~10MPS
|
|
55 | 58 |
|
56 |
so we have roughly 10clocks per pixel this is too high something has to be wrong. |
|
59 |
so we have roughly 10 clocks per pixel. This is too high, something has |
|
60 |
to be wrong. |
|
57 | 61 |
|
58 |
OPTIMIZED clip codes to utilize vec_max and vec_packs removing the need for vec_min. |
|
62 |
OPTIMIZED clip codes to utilize vec_max and vec_packs removing the |
|
63 |
need for vec_min. |
|
59 | 64 |
|
60 |
OPTIMIZED DST OUTPUT cache/dma controls. we are pretty much |
|
61 |
guaranteed to have the input video frame it was just decompressed so |
|
62 |
it probably resides in L1 caches. However we are creating the |
|
63 |
output video stream this needs to use the DSTST instruction to |
|
64 |
optimize for the cache. We couple this with the fact that we are |
|
65 |
not going to be visiting the input buffer again so we mark it Least |
|
66 |
Recently Used. This shaves 25% of the processor cycles off. |
|
65 |
OPTIMIZED DST OUTPUT cache/DMA controls. We are pretty much guaranteed to have |
|
66 |
the input video frame, it was just decompressed so it probably resides in L1 |
|
67 |
caches. However, we are creating the output video stream. This needs to use the |
|
68 |
DSTST instruction to optimize for the cache. We couple this with the fact that |
|
69 |
we are not going to be visiting the input buffer again so we mark it Least |
|
70 |
Recently Used. This shaves 25% of the processor cycles off. |
|
67 | 71 |
|
68 |
Now MEMCPY is the largest mips consumer in the system, probably due
|
|
72 |
Now memcpy is the largest mips consumer in the system, probably due
|
|
69 | 73 |
to the inefficient X11 stuff. |
70 | 74 |
|
71 | 75 |
GL libraries seem to be very slow on this machine 1.33Ghz PB running |
72 | 76 |
Jaguar, this is not the case for my 1Ghz PB. I thought it might be |
73 |
a versioning issues, however I have libGL.1.2.dylib for both
|
|
74 |
machines. ((We need to figure this out now))
|
|
77 |
a versioning issue, however I have libGL.1.2.dylib for both |
|
78 |
machines. (We need to figure this out now.)
|
|
75 | 79 |
|
76 |
GL2 libraries work now with patch for RGB32 |
|
80 |
GL2 libraries work now with patch for RGB32.
|
|
77 | 81 |
|
78 |
NOTE quartz vo driver ARGB32_to_RGB24 consumes 30% of the processor
|
|
82 |
NOTE: quartz vo driver ARGB32_to_RGB24 consumes 30% of the processor.
|
|
79 | 83 |
|
80 |
Integrated luma prescaling adjustment for saturation/contrast/brightness adjustment. |
|
84 |
Integrated luma prescaling adjustment for saturation/contrast/brightness |
|
85 |
adjustment. |
|
81 | 86 |
*/ |
82 | 87 |
|
83 | 88 |
#include <stdio.h> |
Also available in: Unified diff