Making columwise operations faster?

Suppose I have two 2-dimensional waves.

make/o/n=(1000,16) w1= gnoise(1) ,w2 = gnoise(1)

Each column of each wave refer to data from a single experiment, and the column index is a different iteration of the same experiment.
i.e. w1[][k] and w2[][k] refer to two outputs of a single experiment, where k indexes the iteration of the experiment.
I want to construct a wave as follows

duplicate/o w1 w3
w3[][]=w1[p][q]*w2[999-p][q]

Is there a way to make the construction of w3 faster?
Each column is independent, so can we use multiple cores or processes (or anything else) to speed up the construction of w3?


Here's an example:
Function test()
    make/o/n=(1000,16) w1= gnoise(1) ,w2 = gnoise(1)
    duplicate/o w1 w3
    Variable start = StopMSTimer(-2)
    w3[][]=w1[p][q]*w2[999-p][q]
    print (StopMSTimer(-2)-start)
End

Function test_MT()
    make/o/n=(1000,16) w1= gnoise(1) ,w2 = gnoise(1)
    duplicate/o w1 w3
    Variable start = StopMSTimer(-2)
    MultiThread w3[][]=w1[p][q]*w2[999-p][q]
    print (StopMSTimer(-2)-start)
End


When executed on the command line using the same Windows machine with a quad core processor:
•test() 3180.63 •test_MT() 763.69
You could duplicate w2, reverse it and then use MatrixOp to multiply.
Function test_RV()
    make/o/n=(1000,16) w1= gnoise(1) ,w2 = gnoise(1)
    duplicate/o w2 w4
    Variable start = StopMSTimer(-2)
    Reverse w4
    MatrixOp/o w3 = w1 * w4
    print (StopMSTimer(-2)-start)
End

On my machine
•test()
  1452.05
•test_MT()
  839.983
•test_RV()
  172.87
In Igor 7 you can use MatrixOP reverseCols, i.e.



Function test_RV2()
    make/o/n=(1000,16) w1= gnoise(1) ,w2 = gnoise(1)
    Variable start = StopMSTimer(-2)
    MatrixOp/o w3 = w1 * reverseCols(w2)
    print (StopMSTimer(-2)-start)
End


•test()
1082.05
•test_MT()
492.495
•test_RV()
169.394
•test_RV2()
79.089