How can we efficiently import data with varying numbers of columns?

I am trying to load comma-separated data into waves using LoadWave. However, some rows of the data have a different number of columns than others (see attached sample_data.txt file as an example). A full file of data would have about 50000 lines, and I'm trying to automate the loading and parsing of this data by writing an Igor function. I notice in the documentation for LoadWave, all the examples for comma-delimited data use the same number of columns for each row, so I haven't seen an example describing how to handle this case.

So far, I haven't been able to get LoadWave to load all data from rows 3 and 8 in sample_data.txt. I think this is because LoadWave decides the number of columns based on the first line only, and I'm trying to create a flexible function that can handle data even if any given row is longer than the first row.

I can think of one solution for a reusable function, but it seems inefficient:

  • First, only load the first two columns of my data into waves so that I have the timestamp and the instrument ID for every row.
  • Create empty waves to host the other (still unloaded) data
  • Iterate through every row of data
    • Read the instrument ID
    • Based on that instrument ID, I know how many columns of data to expect and how to interpret them (A columnInfoString for LoadWaves /B tag could be created)
    • Use Redimension to expand the length of the appropriate destination waves by 1
    • Use LoadWave to load only that line into the appropriate destination waves

I think this would work, but it requires using Redimension and LoadWave on every single row of data in files of over 50000 rows. This seems clumsy and potentially slow upon execution.

Is there an easier solution that I've missed or I haven't thought of? Thanks in advance for any input!

sample_data.txt

If the lengths of each line are as small as in the example I would suggest you load them as a single column text wave and treat each line of the text wave as a comma separated string list. Something like LoadWave/O/N=LoadedTextWave /K=2 /L={0, 0, 0, 0, 1} /Q /J FullFileName should probably do it. Also have a look at StringFromList

The only issue would be if the comma separated strings become very long, because that would make the StringFromList calls very slow

Hi,

I really appreciate your suggestions.

When loading my data with your suggested LoadWave tags, only the first column (the datetime information) is loaded into the text wave, and the rest of the data is not loaded in. I believe this is because we specified a single column with L={0,0,0,0,1} and we specified that the data was delimited text with /J.

I tried removing the /J tag but this causes Igor to be confused about the file type - I get the following error: "Function Execution Error: While executing LoadWave, the following error occurred: can't handle this type of file".

It's just strange to me that Igor wouldn't be able to load a file like this. I'll keep trying to troubleshoot and let you know if I find a solution.

Thanks again!

The following worked for me.  Note the the file name is not shown here.

LoadWave/J/D/K=0/V={"\t"," $",0,0}

For a test wave where data rows 2 & 3 consisted of 3 elements and the others of 4, the wave looked like this:

print wave2
  wave2[0]= {"Reading,Load,Travel,Time","1,0.000,0.000,0.000","2,0.000,0.000","3,0.000,0.000","4,0.000,0.000,0.098",""}

This kind of data does not fit neatly into Igor's wave concept which is rigid compared to a spreadsheet which is free form.

The only option for text data that includes non-numeric data (date/time, text such as "CH3I_Oven") is LoadWave/J (Load Delimited Text).

Let us call the number of columns to be loaded "n".

In "auto" mode (you omit /L or specify 0 for the number of columns to load) LoadWave/J determines n from the first line of data. In "explicit mode" (you include /L and specify a non-zero value for the number of columns) you explicitly specify n using /L.

If a row has fewer than n columns, LoadWave treats the corresponding cell as blank (NaN). If a row has more than n columns, LoadWave ignores the extra columns.

Given that and the nature of your data, I would load the rows one-at-a-time using LoadWave/J and specifying one row using /L and letting n be auto. After loading each line, the variable V_Flag will be set to the number of waves load and you can determine how to handle that line using that value or from the instrument ID. You will have to put the data together in whatever format works for you.

 

Another way to do this would be to create a modified version of the file in which an additional first line is added. This first line would be guaranteed to have as many columns as the longest subsequent line. Then load the file without /L. The first line would control the number of columns loaded. Rows with fewer columns would result in blanks.

I created a file containing this data which starts with a 9-column dummy line that I added:

1904-01-01 00:00:00, Dummy, 1, 2, 3, 4, 5, 6, 7
2022-01-17 16:10:01, CH3I_Oven, 40.0, 31.0, 0.0, 0
2022-01-17 16:10:01, Catalyst, 350.0, 349.9, 65.0, 1
2022-01-17 16:10:02, HMP110, 5.143088, 30.871475, -11.390093, 1.635575, 1.411919, 12.576250, 34.792370
2022-01-17 16:10:02, Int_Std_Oven, 38.0, 38.0, 65.0, 1
2022-01-17 16:10:02, Acetate_Oven, 50.0, 49.9, 23.0, 1
2022-01-17 16:10:02, CH3I_Oven, 40.0, 31.0, 0.0, 0
2022-01-17 16:10:02, Catalyst, 350.0, 349.9, 65.0, 1
2022-01-17 16:10:02, HMP110, 5.143088, 30.871475, -11.390093, 1.635575, 1.411919, 12.576250, 34.792370
2022-01-17 16:10:02, Int_Std_Oven, 38.0, 38.0, 64.8, 1
2022-01-17 16:10:03, Acetate_Oven, 50.0, 49.9, 21.8, 1

I then successfully loaded it with this command:

LoadWave/J/D/A=wave/E=1/K=0

You would then have to delete the first row from each of the loaded waves either manually via the table or using a function like this:

Function DeleteFirstRow(listOfWaveNames)
    String listOfWaveNames  // Semicolon-separated list of wave names
   
    int numWaves = ItemsInList(listOfWaveNames)
    int i
    for(i=0; i<numWaves; i+=1)
        String name = StringFromList(i, listOfWaveNames)
        WAVE w = $name
        DeletePoints 0, 1, w
    endfor
End

The LoadWave operation sets the string variable S_waveNames to a semicolon-separates list of the names of the waves loaded so you can invoke DeleteFirstRow like this:

DeleteFirstRow(S_waveNames)

 

@jtigor, thanks for the suggestion, but I think your solution only works in the case where there are no rows of data with more columns than the very first row.

@hrodstein, thanks for your response! It's super clear and precise. I think I will try to implement your first suggestion. The second suggestion also seems smart, but since I'm trying to create a function that can be shared with coworkers I'd prefer it to be flexible and a "one-step" solution. Therefore, loading each row individually seems the way to go.

I appreciate everyone's input!