I'm not sure what exactly is so slow when transferring data from VM to native memory. This isn't a problem in normal Java, so it should be possible to get it up to speed in Dalvik too and maybe a JIT will help there too.
Right now, optimizing this is all a big trial and error game. I used to do transfers put by put, which was pretty slow. So i changed that to put large float[]-arrays instead in one call, which was faster. When optimizing the blitting, i did this too, i.e. i changed the single puts to normal float[]-access and one final put at the end...and the result was slower!? I then started to use indexed geometry instead and suddenly, the float[]-method was faster again!?
And that's just on my hardware. I've no idea what might be faster on other hardware.