Brief performance review of key-value store methods in Igor Pro

Interesting experiment.

FWIW, there is no point in using /UOFV with FindValue as that flag is ignored when searching a text wave.

Using individual waves as the key:value store, while surprisingly performant, is a pretty bad idea memory wise. A wave's header (in 64-bit Igor) is 648 bytes. So each wave will require 648 bytes + the size of the data. That gets to be a lot, particularly if you're in the range of 10,000 key:value pairs.

Is the JSON XOP using a hash table for its key:value store? You can write relatively simple code in Igor that uses StringCRC to construct a hash table. Using FindValue to find the CRC of your target string is substantially faster than using it to find some specific text.

Log in or register to post comments

December 7, 2020 at 05:21 pm - Permalink

thomas_braun

> FWIW, there is no point in using /UOFV with FindValue as that flag is ignored when searching a text wave.

Thanks, looks like I missed that.

> Is the JSON XOP using a hash table for its key:value store? You can write relatively simple code in Igor that uses StringCRC to construct a hash table. Using FindValue to find the CRC of your target string is substantially faster than using it to find some specific text.

The underlying implementation in the JSON XOP is a std::map. I'd like to move that to std::unordered_map, but that is not so straightforward.

The idea with StringCRC is a good one. I'd love to have a pure IP implementation of a hash map with amortized constant time lookup.

Log in or register to post comments

December 8, 2020 at 08:09 am - Permalink

aclight

Here is a modified version of your experiment. I've broken out the different test methods so they are controlled by #defines and dropped the number of trials so that the tests run much faster.

I've also added a new method that is implemented by creating a crc32 based hash table.

If you could assume that two different keys would never have the same crc32, then this approach gives constant write time and nearly constant read time (with a significant step at about 3400 elements in my tests). In the attached experiment, this was created with the FindValue command in FindPointIndexOfString that uses /UOFV.

But realistically, you should not assume that two different keys will never give the same crc32 (unless you are actually checking that at insert time, for example). So you should use the FindValue command with the /S flag in the attached experiment. This gives these results:

For a large number of elements, using the hash table for lookup is substantially faster, though the performance is roughly linear, not constant.

One could conceivably make this a lot more complicated (but with closer to constant complexity) by using a better hash (eg. a 4 column unsigned int32 wave to store the 128 bits of an MD5 hash (Igor's hash function with mode 3). You would probably be safe assuming that no two inputs give the same md5 hash. Yes, it *could* happen, but pretty unlikely.

Attachments keyvalue-store-benchmark_1.pxp (51.95 KB)

Log in or register to post comments

December 8, 2020 at 09:55 am - Permalink

thomas_braun

Thanks for your input Adam. You don't need a better hash function. You just need a better way of distributing your data into the buckets and then compare the keys and not the hashes. I'm still struggling to get it always faster than the JSON XOP.

Attachments graph1_7.png (60.44 KB) keyvalue-store-benchmark_with_hashmap.pxp (511.17 KB)

Log in or register to post comments

December 9, 2020 at 03:27 pm - Permalink

aclight

A bit of late night musing:

1. I see minor improvements if I use flags=2 with CmpStr (binary comparison). If you are willing to allow your keys to be case sensitive, then you save a bit of time by skipping the internal conversion of the parameters to the same case.

2. In HM_AddEntry and HM_GetEntry, I think you can do a shortcut. If nextFreeRow == 1, you don't need to do the CmpStr. I didn't test if that has any measurable impact on performance, but I suspect it will.

3. The Make and Redimension calls in HM_AddEntry are likely the cause of the plateau in write time from about 10e2 to 10e5 elements. If you expect the # of elements to be > 10e2 you might as well preallocate the keys and values waves ahead of time, before you start the timer for testing write performance. Yeah, that's sort of cheating, but already you're not counting the execution time of HM_Create. You might as well cheat as much as possible :)

4. In HM_Create, I'm surprised you didn't MultiThread the wave assignment. Although it's possible that this would decrease performance due to the memory allocation in a thread needing to use mutexes when the OS allocates memory for the wave.

Log in or register to post comments

December 9, 2020 at 09:30 pm - Permalink

Brief performance review of key-value store methods in Igor Pro

Brief performance review of key-value store

Key-Value Store Methods

Wave

StringByKey

DimLabel

TextWave

JSON XOP

Benchmark procedure

Results

Discussion

Appendix