Fast alpha blending

I'm optimizing an alpha blending routine I have, and I think I reached the limit of optimization with C++, so I think I'm going to write a couple special cases in x86 Assembly. I have a couple questions:
1. Do I need to write specialized function prologues and epilogues for each compiler? I'm guessing not, since that's the whole point of calling conventions. What about across operating systems?
2. The routine blends two RGBA surfaces together. The formulas are basically this (I eliminated all the integer arithmetic stuff to make it easier to read):

bottom_alpha = 1 - (1 - top_alpha * alpha) * (1 - bottom_alpha) ('alpha' is a multiplier that's applied to the top surface.)
composite_alpha = top_alpha / bottom_alpha
bottom_color = composite_alpha * top_color + (1 - composite_alpha) * bottom_color


The question is, what would be the best SIMD instruction set to implement this? Some reference links wouldn't hurt, either.
Last edited on
Bump.
You don't need to use actual assembly. SIMD instructions are available through intrinsic functions.

The following example should give you an idea. It uses MMX intrinsics to do an alpha blend (if not quite perfect) using a separate alpha map.
You'll probably want to use SSE/SSE2 instructions when working with floating point numbers or just SSE2 if you keep on working with integers, as SSE instructions operate on 128 bits at a time instead of just 64 bits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#define __x86_64__
//the silly define is only necessary when using MinGW, gcc 4.4 can do without
#include <mmintrin.h>
#undef __x86_64__
...
for (int y=top;y<bottom;y++)
{
  uint64_t* ll_screen=...;
  uint64_t* ll_bitmap=...;
  uint64_t* ll_alphamap=...;
  const __m64 zero=_mm_set_pi64x(0);
  const __m64 thfs=_mm_set_pi16(0,256,256,256);
  const uint64_t tc64=static_cast<uint64_t(transparentColor)<<32|static_cast<uint64_t>(transparentColor);
  uint64_t cpix;		
  for (int x=left;x<right;x+=2)
  {
    cpix=*ll_bitmap++;
    if (cpix!=tc64)
    {
  	  __m64 bmpalpha=_mm_set_pi64x(*(ll_alphamap++));
  	  bmpalpha=_mm_unpacklo_pi8(bmpalpha,zero);
  	  __m64 scralpha=_mm_sub_pi16(thfs,bmpalpha);	
  	  __m64 src=_mm_set_pi64x(cpix);
	  __m64 dest=_mm_set_pi64x(*ll_screen);
	  __m64 px1=_mm_unpacklo_pi8(src,zero);
	  __m64 px2=_mm_unpackhi_pi8(src,zero);
	  px1=_mm_mullo_pi16(px1,bmpalpha);
	  px2=_mm_mullo_pi16(px2,bmpalpha);
	  __m64 dpx1=_mm_unpacklo_pi8(dest,zero);
	  __m64 dpx2=_mm_unpackhi_pi8(dest,zero);
	  dpx1=_mm_mullo_pi16(dpx1,scralpha);
	  dpx2=_mm_mullo_pi16(dpx2,scralpha);
	  px1=_mm_add_pi16(px1,dpx1);
	  px2=_mm_add_pi16(px2,dpx2);
	  px1=_mm_srli_pi16(px1,8);
	  px2=_mm_srli_pi16(px2,8);
	  __m64 res=_mm_packs_pu16(px1,px2);
	  *(ll_screen++)=_mm_cvtsi64_si64x(res);
    }
    else ++ll_screen,++ll_alphamap;
  }
}
I'm not really comfortable with intrinsics. Besides, I want to use this chance to try writing some Assembly.
Still, the provided intrinsic functions are the way to go when you want to use SIMD instructions.
Using assembly should be left for when it actually makes sense or when there is no other way.

Edit: not only are the intrinsics much easier to use, but this will also allow the compiler to reorder the instructions into a more favorable order, producing potentially faster code.
Last edited on
To be honest, I tried compiling code using intrinsics, and VC++ rejected it no matter what I did. xmmintrin.h didn't help at all.
I assume you need to tell VC++ to allow the MMX instruction set.
In gcc this is done by activating optimization for Pentium MMX or a newer CPU that supports MMX.
Okay, I think I'm going with intrinsics. I only have one problem. I feel I'm wasting too much time and space converting RGBA8 to RGBA16. I need to do this because the smallest SSE2 multiplication opcode (PMULLW) operates on words. Is there some way I can trick the CPU into multiplying bytes?
Topic archived. No new replies allowed.