Working with large datasets

Dear All,

I have been working on a data analysis routine which reads data from *.csv files, perform several statistical calculations on it and generates some useful easy to understand scientific outputs. One part of the routine handles 3D data containing many zeros and/or NANs. At the end I replace zeros with NANs and exclude NANs by using WaveTransform function. However the data is so large that my computer doesn't have enough memory to keep hundreds of 3D waves with thousends of entries. Could someone have a look at the code and give some programming clues which may solve this issue? I know it seems complicated and confusing. I would be appreciate for any helpful comment.

Cheers,
Emre

 // This part chooses csv files one by one and calls another function to analyse them.
Function MyFunctionforFIA()
	if(DataFolderExists("root:RDDataFolder")==1)
		SetDataFolder root:RDDataFolder
	else
		print "Error! Please first load some data or press reset to delete prevoius data"
		abort
	endif
	
	// Create sizeBinList and use in AnalyseWIBSFiles()
	// make sizeBinList
	Make size_bins1 = {0.5,0.5314,0.5647,0.6001,0.6377,0.6776,0.7201,0.7652,0.8132,0.8642,0.9183,0.9759,1.0371,1.102}
	Make size_bins2 = {1.1711,1.2445,1.3225,1.4054,1.4935,1.5871,1.6866,1.7923,1.9046,2.024,2.1508,2.2856,2.4288,2.5811}
	Make size_bins3 = {2.7824,2.9147,3.0974,3.2915,3.4978,3.7171,3.95,4.1976,4.4607,4.7402,5.0373,5.353,5.6885,6.0451}
	Make size_bins4 = {6.4239,6.8265,7.2544,7.709,8.1922,8.7056,9.2512,9.8311,10.4472,11.102,11.7978,12.5372,13.3229,14.1579,15.0453,15.9882,16.9903,18.0551,20}
	Variable ssb1, ssb2, ssb3, ssb4, ssizebinlist
	ssb1 = numpnts(size_bins1)
	ssb2 = numpnts(size_bins2)
	ssb3 = numpnts(size_bins3)
	ssb4 = numpnts(size_bins4)
	ssizebinlist = ssb1+ssb2+ssb3+ssb4
	Make/O/N=(ssizebinlist) SizeBinList
	Concatenate/NP/O {size_bins1,size_bins2,size_bins3,size_bins4}, SizeBinList
	KillWaves size_bins1, size_bins2, size_bins3, size_bins4
	
	SetDataFolder root:RDDataFolder
	// Choose one file to analyse
	Variable k
	String DataFolderList = ""
	for( k = 0; k < CountObjects("",4); k += 1)
		DataFolderList += GetIndexedObjName("", 4, k)+":;"
	endfor
	
	// Calculate time required to analyse one file
	Variable refnumTimer, timeElapsed, timeasprogress
	// Read strings from list and set data folder accordingly
	Variable numDataFolders = ItemsInList(DataFolderList)
	Variable i
	NVAR perc = root:perc
	Variable progress = 0
	ControlUpdate/W=toolKIT/A
	
	for(i=0; i Threshold && Intensity[Power_280 > 200])
	typeParamNames[1] = NameOfWave(paramSet0)+";"+NameOfWave(paramSet3) // FL2_280 Type (Intensity[FL2_280] > Threshold && Intensity[Power_280 > 200])
	typeParamNames[2] = NameOfWave(paramSet1)+";"+NameOfWave(paramSet4) // FL2_370 Type (Intensity[FL2_370] > Threshold && Intensity[Power_370 > 200])
	
	// set the minimum and maximum values - if there are different numbers of criteria for different types set the column size to the biggest (in terms of # of criteria) type 
	Make /FREE /D /N=(3, 3) minCriteria
	Make /FREE /D /N=(3, 3) maxCriteria
	// set a bunch of arbitrary minimum criteria according to the parameter combinations specified in typeParamNames	
	// FL1_280 criteria
	minCriteria[0][0] = MinFailPower
	maxCriteria[0][0] = inf
	minCriteria[0][1] = Threshold1
	maxCriteria[0][1] = inf
	minCriteria[0][2] = minParSize
	maxCriteria[0][2] = inf
	// FL2_280 criteria
	minCriteria[1][0] = MinFailPower
	maxCriteria[1][0] = inf
	minCriteria[1][1] = Threshold2
	maxCriteria[1][1] = inf
	minCriteria[1][2] = minParSize
	maxCriteria[1][2] = inf
	// FL2_370 criteria
	minCriteria[2][0] = MinFailPower
	maxCriteria[2][0] = inf
	minCriteria[2][1] = Threshold3
	maxCriteria[2][1] = inf
	minCriteria[2][2] = minParSize
	maxCriteria[2][2] = inf

	// make the final output wave
	Variable nRanges = dimSize(DateTimeWave,0)
	Make /O/D/N=(nRanges, numpnts(sizeBinList)-1,3) homeDFR:SizeResFluW
	Wave outWave = homeDFR:SizeResFluW
	
	Make /FREE /N=(3) currMins // set size to the # of criteria in the largest type
	Make /FREE /N=(3) currMaxs // set size to the # of criteria in the largest type
	
	// build the output wave
	// sortARange is the function which takes several inputs and gives one output which will be analysed to find
	// particle counts, distributions, ratios, etc.
	for(i=0; i criteriaMax[layIn])
			satisfied = 0
			//print currDataWave[i]
		endif
	endfor
	
	if (satisfied)
		FindLevel /Q/P binBoundsWave, parSizeWave[rowIn]
		if (!V_flag)
			destinationWave[rowIn][floor(V_LevelX)][layIn] = currDataWave[rowIn]
		endif
	endif
End

// This part appends all resulting data into one single wave. Output is a 3D wave which may contain millions of data points. 
// Apparently this is where I got stuck.
Function AppendFluoreWaves(parentDF, DFpattern, killChildDFs, [sortedAppend, sortOptions, modelDF])
	String parentDF, DFpattern
	Variable killChildDFs
	Variable sortedAppend
	Variable sortOptions
	String modelDF
	
// get list of childDFs matching DFpattern
		if( strlen(parentDF) == 0 )
			parentDF = GetDataFolder(1)
		endif
	String childDFs = ListMatch( ChildDFList(parentDF), DFpattern+"*" )
		if( sortedAppend )		// default is 0 = false
		childDFs = SortList(childDFs, ";", sortOptions)	// default is 0 = ascending case-sensitive 
		endif											// alphabetic ASCII sort
	Variable numChildren = ItemsInList(childDFs)		// count the "good" children
		if( numChildren == 0 )						// ...none -- abort
		Abort "Please first load some data to analyse"
		endif
// build WList from waves in modelDF or in parentDF
	string WList, saveDF = GetDataFolder(1)
	if( !ParamIsDefault(modelDF) )	// use modelDF if specified
		SetDataFolder modelDF
		WList = WaveList("*", ";", "")
	else
		SetDataFolder parentDF	// default to parentDF
		WList = WaveList("*", ";", "")
	endif
	variable numWaves = ItemsInList(WList)		// count waves to append to
	if( numWaves == 0 )						// ...none -- abort
		Abort "AppendWave found no waves to append to in data folder \""+GetDataFolder(1)+"\""
	endif
 
	variable k, m, childDataAppended
	string childDF, currentWave, sourceWave, destWave
// cycle through childDFs
	for( k = 0; k < numChildren; k += 1 )
		childDF = StringFromList(k, childDFs)
		SetDataFolder childDF
		childDataAppended = 0	// don't want to kill childDFs that contained no data to append
// ... cycle through WList
		for( m = 0; m < numWaves; m += 1 )
			currentWave = StringFromList(m, WList)
			sourceWave = childDF+currentWave
			Variable sizeofSourceWave = dimSize($sourceWave,0)
			if( exists(sourceWave) == 1 )	// not all waves in child data folder may be relevant
				destWave = parentDF+currentWave
					
					//if(dimSize($sourceWave,0) > 1)
						//if(dimSize($sourceWave,2) > 0)
							Concatenate/NP=0 sourceWave+";", $destWave
						//endif
					//endif
				childDataAppended = 1
			endif
		endfor
		
			KillDataFolder childDF
		
	endfor
	SetDataFolder saveDF
End

// There is also one another function which deals with statistical analysis by using the main output wave (3D Wave)
tooprock wrote: Dear All,
However the data is so large that my computer doesn't have enough memory to keep hundreds of 3D waves with thousends of entries.


Does that mean you get an "out of memory" error message or what are the symptoms?
How large are those 3D waves? Are they double precision? Could they be in single precision?
Does the routine work when you use a much smaller subset of the data?

The 3-D matrices are created in fluoreSizeBin(...). Why again do you want to turn them back in to 1-D waves with AppendFluoreWaves(...)?

Where are you replacing 0's by NaN's ... I don't see that in your code? Have you thought about first transforming the 0's to NaN's in the CSV files using a text-editor?

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAHuntsville
Hello,

First of all thanks for comments. I will try to explain why and how I do those calculations. Firstly, the routine works great for small datasets.

1) I get an "out of memory error" when I run the routine with large datasets.

2) I can't change zeros with NANs in the csv files because those zeros and/or NANs are output of several calculations. It means, original data may or may not contain zeros and/ot NANs. However this is not directly related to the output data which is a product of several csv files.

3) I don't convert them from 3-D matrices into 1D waves. I just append seperate outputs (one resulting wave from each csv file) into a single wave. However I have to decompose 3-D wave to 1D waves to be able to use WaveStats or StatsQuantiles or mean functions. They work only for 1D waves.

4) Only date time waves have to be in double precision but the others can have single precision.

5) I replace zeros with NANs in a separate function which I didn't put here. This function handles only statistical analysis.

6) With "large" I mean a 3-D matrice consists of almost 10 Million rows, 60 Columns and 3 Layers. There are also single waves having almost same number of data points. When I analyse only 60 csv files (around 500,000 rows, 60 columns, 3 layers) the total memory used reaches 3.4 GB.

So, you suggest using single precision waves instead of double precision to save some memory.

Some more information concerning how the entire toolkit runs:

i) User chooses one or more csv data. These data are read and saved into separate data folders. In each data folder, data points are read from csv files and saved in separate waves ( each column or let's say parameter as new wave).
ii) Routine works for each data folder one by one and the results are written in current data folders to prevent owerwriting.
iii) After that, resulting waves from separate data folders are appended into single waves and the remainder (data folders containing row files) is deleted.

Cheers,
Emre
tooprock wrote: Hello,
So, you suggest using single precision waves instead of double precision to save some memory.


Well, this produces a wave of 360 MB


Make/N=(500000, 60, 3) single


and this of 720 MB


Make/D/N=(500000, 60, 3) double


Depending and what the data are and the precision you need that could be a possibility. Maybe even integer waves... it really depends on what you are working on. Otherwise have a look at


displayhelptopic "Memory management"


tooprock wrote: 6) With "large" I mean a 3-D matrice consists of almost 10 Million rows, 60 Columns and 3 Layers. There are also single waves having almost same number of data points. When I analyse only 60 csv files (around 500,000 rows, 60 columns, 3 layers) the total memory used reaches 3.4 GB.

That's 1.8e9 data points. If it is single precision data, that's 7.2e9 bytes, which is more than twice the address space available to a 32-bit program. And your sequence of steps produces several waves and copies of waves. Within the 3-GByte address space available to a 32-bit program on 32-bit Windows, you also have to load Igor itself, have memory for graphs, etc.

If you have a 64-bit Windows, you have 4 GBytes available, which still isn't enough. But if you have 64-bit Windows you could use Igor64. Go to http://www.wavemetrics.net/ and look for the link to download the Igor64 installer. Be sure to read the notes about installing Igor64.

John Weeks
WaveMetrics, Inc.
support@wavemetrics.com
Using 64-bit is also not a good solution because after some point the process slows down and takes very long to analyse even a small dataset. So, I wrote another function to work out the data in a different way. I am not really sure if this is statistically meaningful but it works pretty good. Thanks for all suggessions.

Cheers,
Emre
I sometimes work on big datasets and in order to get statistics from those datasets I normally chunk the reading.

For example, say you are interested in the mean and standard deviation of the data:


1) open the file using the IGOR open operation.
2) read in e.g. 10000 lines of data and parse into values
3) For the variables you are interested in calculate the number of values (should be 10000, but if you are discounting NaN/Inf it may be smaller), the sum of values and the sum of the squared values. From the 2nd iteration onwards you need to accumulate those sums.
4) goto 2) until you've read the data
5) calculate the statistics you want from the sum of the values and the sum of the squared values.

The point is that you never need to have the whole dataset in memory, just a subset.
..doh, Andy beat me to it, was going to offer the suggestion of parsing the main file into smaller subsets and combining the results. Also, Imagestats has the same kind of functionality as wavestats and it will work on individual planes in 3D waves, so you can avoid some of the extra computations there, perhaps.
Correct me please if I'm wrong. ImageStats works for multidimensional waves but doesn't calculate V_avg, V_min, etc. for each column but for entire matrice. Therefore I decompose a matrice to several 1D waves (one wave for one column) and use WaveStats to find statistical parameters for that single column and save them in a separate wave. However, using a small subset of the data may help solving the problem. Thanks all.
tooprock wrote: Correct me please if I'm wrong. ImageStats works for multidimensional waves but doesn't calculate V_avg, V_min, etc. for each column but for entire matrice. Therefore I decompose a matrice to several 1D waves (one wave for one column) and use WaveStats to find statistical parameters for that single column and save them in a separate wave. However, using a small subset of the data may help solving the problem. Thanks all.


ImageStats is not ideal for per column calculations. Depending on the stats you need it may be far more efficient to use WaveStats with /R=[...] and possibly with /M=1.

A.G.
WaveMetrics, Inc.
ImageStats [/M=val /P=planeNumber ]/G={startP, endP, startQ, endQ } imageWave

lets you pick a plane and then restrict your region of interest to a subset of that plane, all the typical values are returned as with wavestats for this ROI, and you can give it /M=1 if you are only interested in V_avg the local max and min. I just mentioned it because you can use it straight forward rather than recopying large sections of your data out, which seemed to be one of your problems regarding memory usage.
daggaz wrote: ImageStats [/M=val /P=planeNumber ]/G={startP, endP, startQ, endQ } imageWave

lets you pick a plane and then restrict your region of interest to a subset of that plane, all the typical values are returned as with wavestats for this ROI, and you can give it /M=1 if you are only interested in V_avg the local max and min. I just mentioned it because you can use it straight forward rather than recopying large sections of your data out, which seemed to be one of your problems regarding memory usage.


I am aware of this possibility (I wrote it). I'm actually impressed that you found this relatively obscure feature. I still think that WaveStats/M=1 would execute faster. FWIW, this ImageStats feature was added to support ImageStats over rectangular areas without using ROI. In general, using ROI is less efficient because each pixel in the ROI wave is scanned at least once.

A.G.
WaveMetrics, Inc.
Igor wrote: FWIW, this ImageStats feature was added to support ImageStats over rectangular areas without using ROI. In general, using ROI is less efficient because each pixel in the ROI wave is scanned at least once.


So using the /G flag wont scan the pixels more than once or? Seems good. I havent used it on enormous data sets, but on large tiff files it seems to execute as quickly as wavestats, and at least now he wouldnt have to recopy the data into a suitable 1D format for wavestats..

At any rate, didnt realize this was obscure. Wavestats is for 1D and you get sent right to Imagestats in the documentation for higher dimensionalities, and this command is right in the middle and seems to emulate wavestats which is probably why a lot of users got sent there in the first place, trying to use wavestates on the wrong kind of wave. Glad you put it in there!
daggaz wrote: At any rate, didnt realize this was obscure.


You can only find this feature if you read the documentation -- it is not available as a menu or dialog selection. Even if you come across this flag's documentation it is not immediately obvious why it is more efficient than using an ROI.