object name length limits, metaData, hashing, and searching

Igor waves are often given names describing the data in the waves (If you are not up on the current jargon, data describing data is called metaData). Having the metaData embedded in the wave name is convenient for searching for a wave corresponding to particular data, but can lead to 2 problems: names that exceed the 32 character limit on object names in Igor "liberal" names, which must be dealt with specially. Each Igor wave has a waveNote, which is a logical place to store metaData. Finding a wave based on its wavenote, however, means looking at all the waves in the datafolder, which is slow if there are many waves, as it is a linear search. I wrote some code to use hash techniques to limit wavenames to 32 characters while still providing quick retrieval of a wave identified by associated metadata, which is stored in the waveNote. In general, Hash maps offer data storage with quick insertion of key:value pairs and quick retrieval of the value for a given key. A hash map has a set of key:value pairs in an array. The position of each pair in the array is based on a hash of the key. The size of the array can be much less than the size of the number of POSSIBLE keys, but must, of course, be at least as large as the number of ACTUAL key:value pairs. Similarly, the number of POSSIBLE meta-data based wavenames exceeds Igor's 32 character limits, but the number of ACTUAL waves in a data folder is not likely to exceed the number of 32 character long wavenames available. Thus, long strings of metaData can be hashed to shorter strings which are used as wavenames, while still allowing waves containing a particular combination of metaData to be found more quickly than with a linear search. The analogy between a standard hash map and the hash of Igor waves: Standard Hash Map--->Igor WaveName Hash Key--------------------->metaData (stored in waveNote) value ------------------->the wave itself, or a reference to it array position---------->the name of the wave, where the name is in base37 using the legal Igor name characters (0 -9, a-z, _) Some of the "user-facing" functions are described below: //****************************************************************************************************** // Makes a wave named by a hash of the metaData string, puts the metadata in the waveNote, and returns a reference // to that wave. If overWrite parameter is 0 and a wave already exists, returns a null reference and does not overwrite wave function/Wave hashMakeWave (hashFolder, metaDataString, overWrite, wType, pSize, qSize, rSize, sSize, hashSize) string hashFolder // path to the datafolder, the datafolder will be made if it does not exist string metaDataString // string contaning unique metadata for this wave variable overWrite // 1 to overwrite wave if it exists, 0 to not overwrite an existing wave variable wType // wavemetrics code for wavetype. 0 means a text Wave, 2 means 33 bit float, etc variable pSize, qSize, rSize, sSize // size of dimensions for wave. use 0 for unused dimensions variable hashSize // only used if making folder for the first time //****************************************************************************************************** // renames/moves an existing wave to become part of the hash system function hashAddWave (theWave, hashFolder, metaDataString, overWrite, hashSize) wave theWave string hashFolder // path to the datafolder, the datafolder will be made if it does not exist string metaDataString // string contaning unique metadata for this wave variable overWrite // 1 to overwrite wave if it exists, 0 to not overwrite an existing wave variable hashSize // only used if making folder for the first time //****************************************************************************************************** // Gets a wave reference to a hashed wave based on metadata, or returns a null reference if the wave can not be found function/Wave hashGetWave (hashFolder, metaDataStr) string hashFolder string metaDataStr Now this is fine for finding a wave based on all the meta-data used to make the hash. It kind of falls down when trying to get all of the waves that match for a certain element of the metadata. Say we want to retrieve all waves with idCode 75 where we have metadata for date, idCode, and treatment. In the case where the waveNames are made directly from the metadata, we could use something like WaveList("*75*", ";", "" ). But with waveNames hashed on all of date, idCode, and treatment, there is no easy way to get such list. We must fall back to a linear search again, examining the metadata of all the waves in the folder in turn. A quick and dirty workaround would be to hash each type of metadata separately and have the wavename be a concatenation of those hashes, which would allow use of the WaveList function after hashing just the metadata we were interested in listing for WaveList("*" + hash("75") + "*", ";", ""). But this might lead to wavenames longer than 32 characters pretty quickly. It looks like a simple hash is not flexible enough for many common uses. Does anybody have any ideas about applying more database-like capabilities to the problem? Sets and Intersections and that kind of stuff?

741

Hi Jamie,

there's nothing preventing you from creating a relational database in Igor. But I think this is stretching Igor's limits a bit too much, and that you'll end up with lots of work in exchange for poor to mediocre performance.

If I were you I'd start by looking into leaving the relational database functionality to a ... relational database. Have you considered using SQL.xop in combination with something like SQLite?

Log in or register to post comments

March 1, 2012 at 04:12 am - Permalink

aclight

I second 741's suggestion of using the SQL XOP to do this kind of thing. But if you are not familiar with databases and/or the SQL XOP, the learning curve could be steep. And you would need to get your information into the database in the first place.

Several years ago I wrote a package that I used for collecting patch clamp data from cells. Originally, for each trace I recorded, I saved stimulus and recording parameters as KEY:value; pairs in the wave note. The problem with doing this is that a lot of the bytes in the wave note (the KEY strings) are redundant, and my files were getting too large. So I later rewrote the program to create a 2D wave in which each column was a parameter I wanted to save and each row corresponded to a single trace. I then wrote a query engine that used the Extract operation with the data in this 2D wave (and actually several other similar 2D waves that contained other information). This worked well and was plenty fast for my needs but my queries couldn't get too complicated or they would exceed the maximum length of a command line or line in a procedure file (400 characters).

An alternative would be to scrap your hash table completely and go back to just searching through each wave's wave note--but with a twist.

Here is an example function:


Function test(numWavesToGenerate)
	Variable numWavesToGenerate
	
	// Generate a bunch of waves in a
	// new data folder.
	NewDataFolder/O/S test
	WAVE/WAVE generatedWaves = generateWaves(numWavesToGenerate)
	
	// Start timing.
	Variable start = StopMsTimer(-2)
	
	// Instead of searching through a wave note,
	// we'll just calculate the total number of points
	// in all waves in the current data folder.
	Variable totalNumPoints = 0
	
	//////////////////////////////
	// Slow way
	//////////////////////////////
	// Iterate through all the waves in the current
	// data folder and get the number of points in the
	// wave.
	Variable numWaves = CountObjects("", 1);
	Variable n
	
	
	String thisWaveName
	// Start timing the first loop.
	Variable firstLoopStart = StopMsTimer(-2)
	For (n=0; n < numWaves; n+=1)
		thisWaveName = GetIndexedObjName("", 1, n)
		if (strlen(thisWaveName) == 0)
			break
		endif
		
		WAVE thisWave = $(thisWaveName)
		totalNumPoints += numpnts(thisWave)
	EndFor
	
	// Stop timing the first loop.
	Variable firstLoopStop = StopMsTimer(-2)
	print "Total Num Points of first loop:", totalNumPoints
	
	//////////////////////////////////
	// Fast way
	//////////////////////////////////
	// Iterate through all waves in generatedWaves.
	// Since this wave is a wave of wave references,
	// we don't have to look up each wave in the current
	// data folder by name, which gets slower the more
	// waves a data folder contains.
	totalNumPoints = 0
	// Start timing the second loop.
	Variable secondLoopStart = StopMsTimer(-2)
	For (n=0; n < numWaves; n+=1)
		WAVE thisWave = generatedWaves[n]
		totalNumPoints += numpnts(thisWave)
	EndFor
	
	// Stop timing the second loop.
	Variable secondLoopStop = StopMsTimer(-2)
	
	// Stop timing.
	Variable stop = StopMsTimer(-2)
	
	print "Total Num points of second loop:", totalNumPoints
	
	// Go back to the root data folder
	SetDataFolder root:
	
	// Kill the data folder containing the waves.
	KillDataFolder root:test
	
	// Print timing results.
	Variable total = (stop - start)/1000
	Variable first = (firstLoopStop - firstLoopStart)/1000
	Variable second = (secondLoopStop - secondLoopStart)/1000
	printf "%d waves (ms):: Overall: %g\t1st loop: %g\t2nd loop: %g\r", numWavesToGenerate, total, first, second
End


// Generates numToGenerate waves in the
// current data folder.
Function/WAVE generateWaves(numToGenerate)
	Variable numToGenerate
	Variable n
	String thisWaveName

	// Create a free wave to hold wave references to the
	// generated waves.
	Make/O/N=(numToGenerate)/WAVE/FREE waveOfWaves

	For (n=0; n < numToGenerate; n+=1)
		sprintf thisWaveName, "wave%.8d", n
		Make/O $thisWaveName /WAVE=thisWave
		waveOfWaves[n] = thisWave
	EndFor
	return waveOfWaves
End

The first loop iterates through all of the waves in the current data folder using GetIndexedObjName(). This is relatively slow because waves in a data folder are stored in a linked list. This is not a problem when there are not very many waves in a data folder, but when you have thousands of waves in the data folder, the time to iterate through all waves rapidly increases.

The second method uses the wave of wave references returned by the generateWaves() function and iterates through that wave. In this case the time to iterate through the waves is roughly linear, and it is much faster even for fewer waves because there is no need to lookup the waves from the linked list.

On a development build of Igor 6, here are the timings I get:


•test(5000)
  Total Num Points of first loop:  640000
  Total Num points of second loop:  640000
  5000 waves (ms):: Overall: 681.901	1st loop: 679.458	2nd loop: 1.19863
•test(10000)
  Total Num Points of first loop:  1.28e+06
  Total Num points of second loop:  1.28e+06
  10000 waves (ms):: Overall: 2719	1st loop: 2715.07	2nd loop: 2.54419
•test(20000)
  Total Num Points of first loop:  2.56e+06
  Total Num points of second loop:  2.56e+06
  20000 waves (ms):: Overall: 20047	1st loop: 20037.8	2nd loop: 5.97675

Note that in the real world, you would need to generate a wave of WAVE references and store that wave somewhere. Generating that wave the first time could be slow if there are lots of waves, but you only need to generate it once (unless you are adding/deleting waves in the data folder of interest).

Depending on how you are doing the search, it might even be possible to do the search using multiple threads, which could speed things up even further.

Log in or register to post comments

March 1, 2012 at 09:14 am - Permalink

jamie

Dr. Jamie Boyd, Ph.D.
UBC In Vivo Imaging Core

Thanks for the ideas.

Adam, it was a bit of a surprise to me to see that GetIndexedObjName(sourceFolderStr, objectType, ii) runs in linear time, traversing a linked list. I wrote functions that trawl through all the waves in a folder with getIndexedObjectName, doing matching based on wave names, or wave notes, and indeed, they are big O of n^2.

There must be some way to maintain a sorted structure of pointers to waves so you could get the nth wave without having to traverse a whole list. That's where you idea of maintaining your own database of wavereferences is interesting. You could have several lists of references to the same wave, sorted by different metadata. That is, pair each waveReference wave with a wave of corresponding metadata, and sort both waves by the metadata wave. You could find all the data waves corresponding to a range of values for some metadata quite quickly then, by binary search on the metadata wave. But as you point out, you would have to make and maintain the list of wave references.

By contrast, the WaveList(matchStr, separatorStr, optionsStr ) function runs in linear time; it can do the check for the wavename matching on each wave as it traverses the list. I use a function (below) that duplicates (and extends to other Igor objects) the WaveList function. Another advantage is that you don't have to change datafolders to use GetIndexedObjName as with WaveList. I didn't realize what a performance hit that was (n vs n^2).


function wavelistTest (nT)
	variable nT
	
	variable ii, iii
	variable timer, eTIme, nELs
	newdatafolder/o/s root:wlt
	killwaves/a/z
	make/o/n= (nT/10) rTWave, rElWave, rTwave1, rElwave1
	rTWave=0
	relwave =0
	display rTWave vs rElWave
	appendtograph/r rTwave1 vs rElwave1
	modifygraph mode=2, lsize=2
	modifygraph rgb (rTwave1) = (0,0,65535)
	doupdate
	for (ii=0; ii < nT; ii+=1)
		for (iii=0; iii < 3; iii += 1)
			make/o/n = (100) $"W_xxx" + num2str (3*ii+iii)
			make/o/n = (100) $"W_xxx" + num2str (3*ii+iii)
			make/o/n = (100) $"W_yy" + num2str (3*ii+iii)
		endfor
		if (mod (ii, 10) ==0)
		timer = startmstimer
		nELs= itemsinlist (wavelist ("*yy*", ";", ""), ";")
		eTIme = stopMSTimer(timer )
		rTWave [ii/10] = eTIme
		rElWave [ii/10] = nEls
		timer = startmstimer
		nELs= itemsinlist (ListObjects("root:wlt", 1, "*yy*", 0, ""), ";") // uses getIndexedObject in a loop
		eTIme = stopMSTimer(timer )
		rTWave1 [ii/10] = eTIme
		rElWave1 [ii/10] = nEls
		endif
		if (mod (ii, 100) ==0)
		doupdate
		endif
	endfor
end

//******************************************************************************************************
Function/s ListObjects(sourceFolderStr, objectType, matchStr, givepath, erstring)
	string sourceFolderStr // can be either ":" or "" to specify the current data folder. You can also use a full or partial data folder path.
	variable objectType / 1 = waves, 2= numeric variables, 3= string variables, 4 = data folders
	string matchstr // limit list of objects with this wildcard-enhanced string. use "*"  to get all objects	
	variable givepath // use 1 to get full or relative filepaths depending on sourceFolderStr. use 0 to get just the objects with no path.
	string erstring // a string to return if no objects are found
	
	string objName =""  // Will contain all the objects in the folder, one by one as we iterate through the index
	string liststr = ""   // will contain the list of objects found and matched 
	variable numObjs //  the number of objects in the folder under consideration
	variable ii  // used to iterate through the list of objects

	if (!(datafolderexists (sourceFolderStr)))
		return "\\M1(" + sourceFolderStr + " does not exist." // why the "\\M1(" ? It is a formatting code for disabling choices in pop-up menus
	else
		numObjs =  CountObjects(sourceFolderStr, objectType)
		for (ii = 0; ii < numObjs; ii += 1)
			objName = GetIndexedObjName(sourceFolderStr, objectType, ii)
			if (!(stringmatch(objname, matchStr )))
				continue
			endif
			if (givepath & 1)
				if ((cmpstr (sourceFolderStr [strlen(sourceFolderStr)-1], ":")) == 0)
					objName = sourceFolderstr + objname
				else
					objName = sourceFolderstr + ":" + objname
				endif
			endif
			liststr += objname + SelectString(((objectType == 4) && (!(givePath &2))), ";", ":;")
		endfor
		if ((strlen (liststr)) < 2)
			liststr = erstring
		endif
		return liststr
	endif
end

Thinking about the idea of making a wave of wavereferences for indexing: Another simpler to set up and maintain solution might be to have a dictionary containing all the metadata keys mapped to individual letters, and, for each key, another dictionary of all the values for that key mapped to an integer. This mapping would provide short letter/number codes that could be used in wavenames. So the keys would be A,B,C, etc. and the values would be 1,2,3 etc, Wave named would look like A12B31C6. With 32 character length, you could have 7 keys, each of which could contain a maximum of 2^4 values, and still have 2 characters left over. Selecting all the waves that contained a particular value of a particular key would be a matter of finding the corresponding letter code for the key, the integer code for the value, and using the wavelist function, which at least gives us linear time, (and does it in compiled code, not Igor script).

741, the problem with using a database is it requires getting the data into a database, serving the database, and connecting to it. Also, a lot of our data are long waveform traces, or multiframe images. Not the stuff you want to be passing back and forth over a network. That being said, a lot of the analysis is done after some processing, extracting relative fluorescence changes from ROIS, eg. It may make sense to put those results into a database, and then use queries to easily look at results across multiple experiments.

While the topic of databases and metadata-rich scientific data, especially images is at hand, as anyone looked at aOMERO? http://www.openmicroscopy.org/site

OMERO is client-server software for visualisation, management and analysis of biological microscope images.

Looks kind of cool, and it has open source libraries that it would be possible to call from an Igor XOP. There is already Matlab access for it.

jamie

Log in or register to post comments

March 2, 2012 at 02:39 pm - Permalink