Replace duplicates with average values-- remove data points and then insert?

hello I have time series measurements that have occasional duplicate timestamps, my datapoint are around 49000 in number and there are about 2000 pair duplicates. I can find where the duplicates are but don't know how to take average of each pair of duplicate measurements, remove one of the two rows of duplicates and replace the left row with the averaged value. I tried to use delete points, but the point number would then change after each remove which creates problem in ID the next pair of duplicates. Any good solutions to this problem? Other than do math on the point number, such as make the point number -2 in the loop after each remove. Can I get around using deleting points but use other approaches? Thank you!

HJDrescher

A rather good solution to the index thing (using deletepoints) might be to go backwards through the index (rather N-1 ... 0 than 0 ... N-1) . It can also be slightly faster (less data to move once something is deleted), depending on data structure.

Are the timestamps monotonic (I would guess so but better be sure)? Are there triplets?
In case time stamps are monotonic, I'd probably crawl through the data last set to first set and check whether the 'next' (actually previous) one has the same time stamp. If so, average them, store the result in the high index set, and delete the low index one. Repeat until you reach index 0. Caution with multiplets here...

HJ

Log in or register to post comments

April 29, 2018 at 01:45 pm - Permalink

mwpro

Thank you HJ! It surely worked!

Log in or register to post comments

April 29, 2018 at 03:50 pm - Permalink

jjweimer

I have to wonder whether a method exists to avoid looping (backwards). Perhaps a clever combination of

FindDuplicates

, with the resultant wave + source wave blended through

MatrixOP

using implicit indexing or in-line logical testing?

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAH

Log in or register to post comments

April 30, 2018 at 12:47 pm - Permalink

Igor

You may want to take a look at FindDuplicates (requires IP7).

Log in or register to post comments

April 30, 2018 at 01:01 pm - Permalink

hrodstein

Each time you call DeletePoints in a loop all points after the deleted point must be moved in memory. If you are dealing with a large wave and have to do a large number of deletions, that will be slow.

Another approach, which may or may not be faster for a given set of input data, is to loop through the entire input dataset and copy the required data from a pair of input waves to a pair of output waves. Here is a function that I wrote to do this. It does not use DeletePoints but instead calls Redimension once. I have tested it somewhat but don't claim it to be foolproof. It also should work for more than two consecutive identical X values but I have not tested that.


Function RemoveDuplicatesXY(xWave, yWave)
	Wave xWave, yWave
	
	Duplicate/FREE xWave, xWaveCopy
	Duplicate/FREE yWave, yWaveCopy
	
	Variable numPointsIn = numpnts(xWave)
	Variable numPointsOut = 0
	Variable previousX = xWaveCopy[0]
	Variable previousY = yWaveCopy[0]
	Variable currentX, currentY
	Variable numPointsWithThisX = 1	
	Variable sumOfYValuesWithThisX = previousY
	Variable i
	for(i=1; i<numPointsIn; i+=1)		// Handle up to but not including the last point
		currentX = xWaveCopy[i]
		currentY = yWaveCopy[i]		
		if (currentX != previousX)
			xWave[numPointsOut] = previousX
			yWave[numPointsOut] = sumOfYValuesWithThisX / numPointsWithThisX
			numPointsOut += 1
			numPointsWithThisX = 0
			sumOfYValuesWithThisX = 0
		endif
		numPointsWithThisX += 1
		sumOfYValuesWithThisX += currentY
		previousX = currentX	
		previousY = currentY	
	endfor

	// Handle the last input point
	if (currentX != previousX)
		// Last point is not a duplicate
		numPointsWithThisX = 1
		sumOfYValuesWithThisX = currentY
	endif
	xWave[numPointsOut] = currentX
	yWave[numPointsOut] = sumOfYValuesWithThisX / numPointsWithThisX
	numPointsOut += 1
	
	Redimension/N=(numPointsOut) xWave, yWave
	
	// For debugging only
	#if 0
		Printf "Number of input points=%d, number of output points=%d\r", numPointsIn, numPointsOut
	#endif
	
	return numPointsOut
End

I am attaching the experiment that I used to test this function. The experiment includes another function for timing how long the function takes on a particular XY pair.

Log in or register to post comments

April 30, 2018 at 05:50 pm - Permalink

hrodstein

The file I attached had some unintended junk in it. Here is a cleaned up version.

Attachments RemoveDuplicatesXY Demo_0.pxp (10.84 KB)

Log in or register to post comments

April 30, 2018 at 05:53 pm - Permalink