Brief performance review of key-value store methods in Igor Pro
Fri, 12/04/2020 - 12:46 pm
Brief performance review of key-value store
Braun T., Huth M.
Choosing suitable data structures to store data in a program is important for performance and maintenance. One common way is to store data on a key-value basis. The idea is to store data like a dictionary where a key value identifies the data record 1).
We show in this article a performance review of several approaches to store data on a key-value basis in Igor Pro. In the first section different methods are introduced that were benchmarked. Then the benchmark procedure is described and the results are discussed. At the end the experiment file with the benchmark routine can be downloaded.
Key-Value Store Methods
The simplest approach is to use a wave, where the wave name is the key. For each key a single element wave is created with the stored data in this element:
Make/T/N=1 dfr:$key/WAVE=wv wv = dataStr
WAVE/SDFR=dfr/T wv = $key dataStr = wv
For retrieving key-value pairs in strings Igor Pro has an own function StringByKey:
str += key + ":" + dataStr + ";"
dataStr = StringByKey(key, str)
Another method is using a wave and setting the keys as dimension labels. The data can be retrieved by the %dimlabel notation from the wave.
SetdimLabel 0, index, $key, textWave textWave[index] = dataStr
dataStr = textWave[%$key]
Similarly we can use a text wave to store the key and use the index position as value. Actual data could be indexed then from another wave.
textWave[index] = key
FindValue/TEXT=(key)/TXOP=4/UOFV textWave index = V_Value
JSON_SetString(jsonID, "/" + key, dataStr)
dataStr = JSON_GetString(jsonID, "/" + key)
Also noted should be that Igor Pro includes a SQL XOP to access SQL databases that in principle can store data by key-value pairs as well. It is not covered in this performance review as it not only requires to activate the XOP but a fully setup SQL server to connect to as well.
For each method a key-value dataset is created by writing a number of key-value pairs to the data structure that the method uses. For benchmarking data retrieval, values are read back with an initial set of random ordered keys. To get good statistics on execution times a full dataset write/read cycle is repeated 10,000 times for dataset sizes up to 316 elements, 1000 times for dataset sizes up to 10,000 elements and 100 times for dataset sizes up to 1,000,000 elements.
The first graph shows the execution time per read/write cycle versus the dataset size up to a size of 1,000,000 elements. The execution time is plotted using logarithmic scale. The curves indicate that the following operations show exponential performance decrease:
- Reading by Dimension Label
- Writing by Dimension Label
- Reading by Text Wave
- Reading by StringByKey
The onset where these methods get considerably slower than the other methods is at around 550 (~10^2.75) elements.
The only method that shows linear performance for reading and writing is the JSON method. For dataset sizes >6300 elements reading from a wave (blue circles) also shows an exponential performance decrease with a small slope.
The second graph shows the summed execution times from reading and writing and better reflects a generic use case. Up to a size of ~300 elements the Wave, StringByKey, Dimension Label and TextWave methods are the fastest. Increasing the dataset size to up to 10,000 key-value pairs the fastest method is storing each value in an own wave. However, working with tens of thousands of waves can create other problems such as experiencing a performance penalty when working with other data in the same data folder. Also if the key is not a valid object name for Igor Pro, additional code for e.g. calculating a hash value has to be executed. The free JSON XOP is then a viable alternative that matches for larger datasets the performance of resolving waves by key. The exponential increase of the wave method for larger datasets indicates that the wave method will have a strong performance decrease for increasingly larger data sets.
Here is the experiment with the benchmark routine: Experiment File
The results were obtained on a system with AMD Threadripper 1950X 3.4 GHz, 64 GB DDR4-2400 RAM, Windows 10 Pro, Igor Pro 8 Build 36002.
Note on error bars: In a simple approximation the errors scale with
sqrt(times_run) if the timing distribution can be assumed gaussian. For obvious reasons, such as there is a minimum execution time limit the actual timing distribution is different. Considering that the measuring routine gets interrupted on a regular basis by the operating system only the results starting from a data size of 10 and larger have reasonably small confidence intervals. Some data points that show a characteristic increase for a specific data size proved to be reproducible on consecutive benchmark runs and could be related to properties of the benchmark system, e.g. caches. For larger dataset sizes the confidence interval is sufficently small to have no influence on the overall curve characteristic.