Working with large datasets

// This part chooses csv files one by one and calls another function to analyse them. Function MyFunctionforFIA() if(DataFolderExists("root:RDDataFolder")==1) SetDataFolder root:RDDataFolder else print "Error! Please first load some data or press reset to delete prevoius data" abort endif // Create sizeBinList and use in AnalyseWIBSFiles() // make sizeBinList Make size_bins1 = {0.5,0.5314,0.5647,0.6001,0.6377,0.6776,0.7201,0.7652,0.8132,0.8642,0.9183,0.9759,1.0371,1.102} Make size_bins2 = {1.1711,1.2445,1.3225,1.4054,1.4935,1.5871,1.6866,1.7923,1.9046,2.024,2.1508,2.2856,2.4288,2.5811} Make size_bins3 = {2.7824,2.9147,3.0974,3.2915,3.4978,3.7171,3.95,4.1976,4.4607,4.7402,5.0373,5.353,5.6885,6.0451} Make size_bins4 = {6.4239,6.8265,7.2544,7.709,8.1922,8.7056,9.2512,9.8311,10.4472,11.102,11.7978,12.5372,13.3229,14.1579,15.0453,15.9882,16.9903,18.0551,20} Variable ssb1, ssb2, ssb3, ssb4, ssizebinlist ssb1 = numpnts(size_bins1) ssb2 = numpnts(size_bins2) ssb3 = numpnts(size_bins3) ssb4 = numpnts(size_bins4) ssizebinlist = ssb1+ssb2+ssb3+ssb4 Make/O/N=(ssizebinlist) SizeBinList Concatenate/NP/O {size_bins1,size_bins2,size_bins3,size_bins4}, SizeBinList KillWaves size_bins1, size_bins2, size_bins3, size_bins4 SetDataFolder root:RDDataFolder // Choose one file to analyse Variable k String DataFolderList = "" for( k = 0; k < CountObjects("",4); k += 1) DataFolderList += GetIndexedObjName("", 4, k)+":;" endfor // Calculate time required to analyse one file Variable refnumTimer, timeElapsed, timeasprogress // Read strings from list and set data folder accordingly Variable numDataFolders = ItemsInList(DataFolderList) Variable i NVAR perc = root:perc Variable progress = 0 ControlUpdate/W=toolKIT/A for(i=0; i Threshold && Intensity[Power_280 > 200]) typeParamNames[1] = NameOfWave(paramSet0)+";"+NameOfWave(paramSet3) // FL2_280 Type (Intensity[FL2_280] > Threshold && Intensity[Power_280 > 200]) typeParamNames[2] = NameOfWave(paramSet1)+";"+NameOfWave(paramSet4) // FL2_370 Type (Intensity[FL2_370] > Threshold && Intensity[Power_370 > 200]) // set the minimum and maximum values - if there are different numbers of criteria for different types set the column size to the biggest (in terms of # of criteria) type Make /FREE /D /N=(3, 3) minCriteria Make /FREE /D /N=(3, 3) maxCriteria // set a bunch of arbitrary minimum criteria according to the parameter combinations specified in typeParamNames // FL1_280 criteria minCriteria[0][0] = MinFailPower maxCriteria[0][0] = inf minCriteria[0][1] = Threshold1 maxCriteria[0][1] = inf minCriteria[0][2] = minParSize maxCriteria[0][2] = inf // FL2_280 criteria minCriteria[1][0] = MinFailPower maxCriteria[1][0] = inf minCriteria[1][1] = Threshold2 maxCriteria[1][1] = inf minCriteria[1][2] = minParSize maxCriteria[1][2] = inf // FL2_370 criteria minCriteria[2][0] = MinFailPower maxCriteria[2][0] = inf minCriteria[2][1] = Threshold3 maxCriteria[2][1] = inf minCriteria[2][2] = minParSize maxCriteria[2][2] = inf // make the final output wave Variable nRanges = dimSize(DateTimeWave,0) Make /O/D/N=(nRanges, numpnts(sizeBinList)-1,3) homeDFR:SizeResFluW Wave outWave = homeDFR:SizeResFluW Make /FREE /N=(3) currMins // set size to the # of criteria in the largest type Make /FREE /N=(3) currMaxs // set size to the # of criteria in the largest type // build the output wave // sortARange is the function which takes several inputs and gives one output which will be analysed to find // particle counts, distributions, ratios, etc. for(i=0; i criteriaMax[layIn]) satisfied = 0 //print currDataWave[i] endif endfor if (satisfied) FindLevel /Q/P binBoundsWave, parSizeWave[rowIn] if (!V_flag) destinationWave[rowIn][floor(V_LevelX)][layIn] = currDataWave[rowIn] endif endif End // This part appends all resulting data into one single wave. Output is a 3D wave which may contain millions of data points. // Apparently this is where I got stuck. Function AppendFluoreWaves(parentDF, DFpattern, killChildDFs, [sortedAppend, sortOptions, modelDF]) String parentDF, DFpattern Variable killChildDFs Variable sortedAppend Variable sortOptions String modelDF // get list of childDFs matching DFpattern if( strlen(parentDF) == 0 ) parentDF = GetDataFolder(1) endif String childDFs = ListMatch( ChildDFList(parentDF), DFpattern+"*" ) if( sortedAppend ) // default is 0 = false childDFs = SortList(childDFs, ";", sortOptions) // default is 0 = ascending case-sensitive endif // alphabetic ASCII sort Variable numChildren = ItemsInList(childDFs) // count the "good" children if( numChildren == 0 ) // ...none -- abort Abort "Please first load some data to analyse" endif // build WList from waves in modelDF or in parentDF string WList, saveDF = GetDataFolder(1) if( !ParamIsDefault(modelDF) ) // use modelDF if specified SetDataFolder modelDF WList = WaveList("*", ";", "") else SetDataFolder parentDF // default to parentDF WList = WaveList("*", ";", "") endif variable numWaves = ItemsInList(WList) // count waves to append to if( numWaves == 0 ) // ...none -- abort Abort "AppendWave found no waves to append to in data folder \""+GetDataFolder(1)+"\"" endif variable k, m, childDataAppended string childDF, currentWave, sourceWave, destWave // cycle through childDFs for( k = 0; k < numChildren; k += 1 ) childDF = StringFromList(k, childDFs) SetDataFolder childDF childDataAppended = 0 // don't want to kill childDFs that contained no data to append // ... cycle through WList for( m = 0; m < numWaves; m += 1 ) currentWave = StringFromList(m, WList) sourceWave = childDF+currentWave Variable sizeofSourceWave = dimSize($sourceWave,0) if( exists(sourceWave) == 1 ) // not all waves in child data folder may be relevant destWave = parentDF+currentWave //if(dimSize($sourceWave,0) > 1) //if(dimSize($sourceWave,2) > 0) Concatenate/NP=0 sourceWave+";", $destWave //endif //endif childDataAppended = 1 endif endfor KillDataFolder childDF endfor SetDataFolder saveDF End // There is also one another function which deals with statistical analysis by using the main output wave (3D Wave)

ChrLie

tooprock wrote: Dear All,
However the data is so large that my computer doesn't have enough memory to keep hundreds of 3D waves with thousends of entries.

Does that mean you get an "out of memory" error message or what are the symptoms?
How large are those 3D waves? Are they double precision? Could they be in single precision?

May 6, 2013 at 08:31 am - Permalink

jjweimer

Does the routine work when you use a much smaller subset of the data?

The 3-D matrices are created in fluoreSizeBin(...). Why again do you want to turn them back in to 1-D waves with AppendFluoreWaves(...)?

Where are you replacing 0's by NaN's ... I don't see that in your code? Have you thought about first transforming the 0's to NaN's in the CSV files using a text-editor?

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAHuntsville

May 6, 2013 at 09:29 am - Permalink

tooprock

Hello,

First of all thanks for comments. I will try to explain why and how I do those calculations. Firstly, the routine works great for small datasets.

1) I get an "out of memory error" when I run the routine with large datasets.

2) I can't change zeros with NANs in the csv files because those zeros and/or NANs are output of several calculations. It means, original data may or may not contain zeros and/ot NANs. However this is not directly related to the output data which is a product of several csv files.

3) I don't convert them from 3-D matrices into 1D waves. I just append seperate outputs (one resulting wave from each csv file) into a single wave. However I have to decompose 3-D wave to 1D waves to be able to use WaveStats or StatsQuantiles or mean functions. They work only for 1D waves.

4) Only date time waves have to be in double precision but the others can have single precision.

5) I replace zeros with NANs in a separate function which I didn't put here. This function handles only statistical analysis.

6) With "large" I mean a 3-D matrice consists of almost 10 Million rows, 60 Columns and 3 Layers. There are also single waves having almost same number of data points. When I analyse only 60 csv files (around 500,000 rows, 60 columns, 3 layers) the total memory used reaches 3.4 GB.

So, you suggest using single precision waves instead of double precision to save some memory.

Some more information concerning how the entire toolkit runs:

i) User chooses one or more csv data. These data are read and saved into separate data folders. In each data folder, data points are read from csv files and saved in separate waves ( each column or let's say parameter as new wave).
ii) Routine works for each data folder one by one and the results are written in current data folders to prevent owerwriting.
iii) After that, resulting waves from separate data folders are appended into single waves and the remainder (data folders containing row files) is deleted.

Cheers,
Emre

May 7, 2013 at 05:32 am - Permalink

ChrLie

tooprock wrote: Hello,
So, you suggest using single precision waves instead of double precision to save some memory.

Well, this produces a wave of 360 MB


Make/N=(500000, 60, 3) single

and this of 720 MB


Make/D/N=(500000, 60, 3) double

Depending and what the data are and the precision you need that could be a possibility. Maybe even integer waves... it really depends on what you are working on. Otherwise have a look at


displayhelptopic "Memory management"

May 7, 2013 at 06:56 am - Permalink

tooprock

Thanks for your comment. I have already checked those. It's not helping. :(

May 7, 2013 at 08:02 am - Permalink

johnweeks

tooprock wrote: 6) With "large" I mean a 3-D matrice consists of almost 10 Million rows, 60 Columns and 3 Layers. There are also single waves having almost same number of data points. When I analyse only 60 csv files (around 500,000 rows, 60 columns, 3 layers) the total memory used reaches 3.4 GB.

That's 1.8e9 data points. If it is single precision data, that's 7.2e9 bytes, which is more than twice the address space available to a 32-bit program. And your sequence of steps produces several waves and copies of waves. Within the 3-GByte address space available to a 32-bit program on 32-bit Windows, you also have to load Igor itself, have memory for graphs, etc.

If you have a 64-bit Windows, you have 4 GBytes available, which still isn't enough. But if you have 64-bit Windows you could use Igor64. Go to http://www.wavemetrics.net/ and look for the link to download the Igor64 installer. Be sure to read the notes about installing Igor64.

John Weeks
WaveMetrics, Inc.
support@wavemetrics.com

May 7, 2013 at 09:55 am - Permalink

tooprock

Using 64-bit is also not a good solution because after some point the process slows down and takes very long to analyse even a small dataset. So, I wrote another function to work out the data in a different way. I am not really sure if this is statistically meaningful but it works pretty good. Thanks for all suggessions.

Cheers,
Emre

May 14, 2013 at 04:55 am - Permalink

andyfaff

I sometimes work on big datasets and in order to get statistics from those datasets I normally chunk the reading.

For example, say you are interested in the mean and standard deviation of the data:

1) open the file using the IGOR open operation.
2) read in e.g. 10000 lines of data and parse into values
3) For the variables you are interested in calculate the number of values (should be 10000, but if you are discounting NaN/Inf it may be smaller), the sum of values and the sum of the squared values. From the 2nd iteration onwards you need to accumulate those sums.
4) goto 2) until you've read the data
5) calculate the statistics you want from the sum of the values and the sum of the squared values.

The point is that you never need to have the whole dataset in memory, just a subset.

May 14, 2013 at 11:16 am - Permalink

daggaz

..doh, Andy beat me to it, was going to offer the suggestion of parsing the main file into smaller subsets and combining the results. Also, Imagestats has the same kind of functionality as wavestats and it will work on individual planes in 3D waves, so you can avoid some of the extra computations there, perhaps.

May 15, 2013 at 12:24 am - Permalink

tooprock

Correct me please if I'm wrong. ImageStats works for multidimensional waves but doesn't calculate V_avg, V_min, etc. for each column but for entire matrice. Therefore I decompose a matrice to several 1D waves (one wave for one column) and use WaveStats to find statistical parameters for that single column and save them in a separate wave. However, using a small subset of the data may help solving the problem. Thanks all.

May 15, 2013 at 06:32 am - Permalink

Igor

tooprock wrote: Correct me please if I'm wrong. ImageStats works for multidimensional waves but doesn't calculate V_avg, V_min, etc. for each column but for entire matrice. Therefore I decompose a matrice to several 1D waves (one wave for one column) and use WaveStats to find statistical parameters for that single column and save them in a separate wave. However, using a small subset of the data may help solving the problem. Thanks all.

ImageStats is not ideal for per column calculations. Depending on the stats you need it may be far more efficient to use WaveStats with /R=[...] and possibly with /M=1.

A.G.
WaveMetrics, Inc.

May 15, 2013 at 10:11 am - Permalink

daggaz

ImageStats [/M=val /P=planeNumber ]/G={startP, endP, startQ, endQ } imageWave

lets you pick a plane and then restrict your region of interest to a subset of that plane, all the typical values are returned as with wavestats for this ROI, and you can give it /M=1 if you are only interested in V_avg the local max and min. I just mentioned it because you can use it straight forward rather than recopying large sections of your data out, which seemed to be one of your problems regarding memory usage.

May 16, 2013 at 06:05 am - Permalink

Igor

daggaz wrote: ImageStats [/M=val /P=planeNumber ]/G={startP, endP, startQ, endQ } imageWave

lets you pick a plane and then restrict your region of interest to a subset of that plane, all the typical values are returned as with wavestats for this ROI, and you can give it /M=1 if you are only interested in V_avg the local max and min. I just mentioned it because you can use it straight forward rather than recopying large sections of your data out, which seemed to be one of your problems regarding memory usage.

I am aware of this possibility (I wrote it). I'm actually impressed that you found this relatively obscure feature. I still think that WaveStats/M=1 would execute faster. FWIW, this ImageStats feature was added to support ImageStats over rectangular areas without using ROI. In general, using ROI is less efficient because each pixel in the ROI wave is scanned at least once.

A.G.
WaveMetrics, Inc.

May 16, 2013 at 12:37 pm - Permalink

daggaz

Igor wrote: FWIW, this ImageStats feature was added to support ImageStats over rectangular areas without using ROI. In general, using ROI is less efficient because each pixel in the ROI wave is scanned at least once.

So using the /G flag wont scan the pixels more than once or? Seems good. I havent used it on enormous data sets, but on large tiff files it seems to execute as quickly as wavestats, and at least now he wouldnt have to recopy the data into a suitable 1D format for wavestats..

At any rate, didnt realize this was obscure. Wavestats is for 1D and you get sent right to Imagestats in the documentation for higher dimensionalities, and this command is right in the middle and seems to emulate wavestats which is probably why a lot of users got sent there in the first place, trying to use wavestates on the wrong kind of wave. Glad you put it in there!

May 20, 2013 at 03:18 am - Permalink

Igor

daggaz wrote: At any rate, didnt realize this was obscure.

You can only find this feature if you read the documentation -- it is not available as a menu or dialog selection. Even if you come across this flag's documentation it is not immediately obvious why it is more efficient than using an ROI.

May 20, 2013 at 10:29 am - Permalink