Reading and extracting different sections from a file in one go?

Hi there,

I have a routine that I use to read the contents from a file and place them into waves. The read and load function works fine, however, as of now, the process I use requires me to load the same file multiple times to extract the contents I want. For example, let's say I have some file called File A, and this file has 3 different sections that I would like to extract.

File A would look something like this:

stuff

...

beginning of section 1

...

end of section 1

stuff

beginning of section 2

...

end of section 2

stuff

beginning of section 3

...

end of section 3

stuff

The function I use makes use of FReadLine while looking for some targetString that identifies the beginning of the section and a target String that identifies the end of the section. This read function is enclosed within another function that  loads the data from that file into the appropriate waves for IGOR using the /B flag to to define the spacing of the various columns. Therefore, I would end up calling this function with different values for the target String for each section I want which has gotten to the point where it's somewhat annoying to use. What's a good strategy to accomplish this? Any nudge in the appropriate direction would be appreciated. Thanks!

Here are examples of the functions I use.

This is the function that reads the document and looks for the target strings identifying the current section I want

Function FindGND_Energy(pathName, filePath, firstDataLine, numDataLines)
    String pathName     // Name of symbolic path or ""
    String filePath         // Name of file or partial path relative to symbolic path
    Variable &firstDataLine // Pass-by-reference output
    Variable &numDataLines  // Pass-by-reference output
 
    firstDataLine = -1
    numDataLines = -1
 
    Variable refNum
 
    Open/R/P=$pathName refNum as filePath
 
    String buffer, text
    Variable line = 0
 
    String targetString = " Total energy   (H) ="
    Variable targetStringLength = strlen(targetString)
 
    // Find first line
    do
        FReadLine refNum, buffer
        if (strlen(buffer) == 0)
            Close refNum
            return -1                       // The expected keyword was not found in the file
        endif
        text = buffer[0,targetStringLength-1]
        if (CmpStr(text,targetString) == 0)
            firstDataLine = line
            break                           // This is is the first data line
        endif
        line += 1
    while(1)
 
    // Find last line
    targetString = " Nuc-nuc energy (H) ="
    targetStringLength = strlen(targetString)
    do
        FReadLine refNum, buffer
        if (strlen(buffer) == 0)
            // Ran out of lines - assume this is the last line of data
            line += 1
            break
        endif
        text = buffer[0,targetStringLength-1]
        if (CmpStr(text,targetString) != 0) // Line does not start with "<space>#>?
            // This is is the line after the last data line
            break  
        endif
        line += 1
    while(1)
 
    numDataLines = line - firstDataLine + 1
 
    // Print firstDataLine, numDataLines        // For debugging only
 
    Close refNum
 
    return 0        // Success
End

This is the function used to load an individual file using the read function for a particular section of the document.

Function extractGND_En(pathName, filePath, extension,cAtom)
    String pathName     // Name of symbolic path or "" to display dialog.
    String filePath         // Name of file or "" to display dialog. Can also be full or partial path relative to symbolic path.
    String extension
    Variable cAtom
    Variable refNum
 
    // Possibly display Open File dialog.
    if ((strlen(pathName)==0) || (strlen(filePath)==0))
        Open /D /R /P=$pathName /T=(extension) refNum as filePath
        filePath = S_fileName           // S_fileName is set by Open/D
        if (strlen(filePath) == 0)      // User cancelled?
            return -1
        endif
        // filePath is now a full path to the file.
    endif
 
    Variable firstDataLine, numLines
    Variable GNDE = FindGND_Energy(pathName, filePath, firstDataLine, numLines)
    if (GNDE != 0)
        Printf "No data found in file %s\r", filePath
        return -1
    endif
   
    String DFTData = RemoveEnding(filePath,"gnd.out")
   
    String columnInfoStr = ""       // Prepare parameter for /B flag
   
    columnInfoStr += "N='_skip_',W=1;"
    columnInfoStr += "N='_skip_',W=20;"
    columnInfoStr += "N='_skip_',W=3;"
    columnInfoStr += "N='GND_En_',W=16;"
    columnInfoStr += "N='_skip_',W=40;"
 
    LoadWave /F={6, 11, 0} /B=columnInfoStr /D /O /K=0 /L={0,firstDataLine,numLines,0,0} /A /P=$pathName filePath
   
    Wave GND_En_
    String currentGNDname = "GND_En_" + num2str(cAtom)
    Duplicate/O GND_En_,$currentGNDname
    KillWaves/Z GND_En_
    SetDataFolder root:Packages:DFTClustering
    return 0
End

This is the function that takes the above functions and uses them to load multiple files:

Function LoadAllGND_En(pathName,filePath,nFiles)    //Loads all .out files from StoBe containing info necessary to build tensor
   
    String pathName                 //Name of symbolic path of folder containing files to be loaded. You can also type "" to create a new symbolic path.
    String filePath                 //Name of file to be loaded as defined in LoadDFTData procedure
    Variable nFiles
    Variable i  = 0                 //Used to index files.
    Variable result
   
    if (strlen(pathName)==0)
        NewPath/O tempPath
        if (V_flag !=0)
            return -1
        endif
        pathName = "tempPath"
    endif
   
    i=0

    for(i=0;i<nFiles;i+=1) //Loops through every file in folder with a .out extension
        filePath = SortList(IndexedFile($pathName,-1,".out"),";",16)
        String currentFile = StringFromList(i,filePath)
        currentFile =  RemoveEnding(currentFile,";")

        if (strlen(filePath) ==0)       //if there are no more files
            break                   //end loop
        endif
        result = extractGND_En(pathName, currentFile, ".out",i+1)
    endfor                  //The loop ends after there are no more files with a .out extension present
   
    if (Exists("tempPath"))
        KillPath tempPath
    endif

    print filePath
   
    String GNDList = WaveList("GND_En*",";","")
    Make/O/N=(nFiles) GND
    for(i=0;i<nFiles;i+=1)
        String cGNDname = StringFromList(i,GNDList)
        Wave w = $cGNDname
        GND[i] = w[0]
        KillWaves/Z w
    endfor
   
    return 0
End

 

I assume the sections are marked with the lines you are defining as targetString, right? If not how would you discern the different sections of a file? A quick look at your functions gives me the impression that it would be the easiest approach to build another loop around the line searching function of FindGND_Energy() to have not a single value of firstDataLine and numDataLines, but to build a list of multiple entries for all the sections (if there are more than one). You could for example just replace the variables with a wave which gets extended as more sections are found. A single entry wave is than the previous case with only one section. Then within extractGND_En() loop through this list to load all sections successively. An alternative approach would be to merge both functions and refactor the code so that the section data is loaded after both firstDataLine and numDataLines are set and then repeat this process in a loop until all sections are processed.

In reply to by chozo

That's correct. The different sections have different values for targetString and have their own read and load functions associated with it. Your suggestion makes sense. I'll see if I can implement it. Thanks!

Create a wave of file paths.

wave/T filepaths 

Use the first function to loop through all files in this wave. Store the target results in a matrix wave.

wave/N=(nfiles,2) startline, nlines

Use the second function with that wave as input to loop again through all files to get the values.

This modular approach is easier to debug in a step-wise manner. Also, you might consider passing the start and end strings as inputs to the first function rather than hard coding them in the function. This change will make it easier to re-use the code to parse through the file set for the locations of other start/end tags.

The previous suggestions are both good ones.

First, your FindGND_Energy function loops through the file twice. There's not point in doing this, and until Igor 9 (in beta soon), FReadLine is a lot slower than it could be. I would refactor the function to loop through once and test each line for the start and end strings. Even better, if you can rely on the fact that the end string won't be before the start string, and/or that the start string will only be present once, then you can start with testing each line for only the start string. Once you find it, then you test each line for only the end string.

An alternative to using FReadLine in a loop would be to use Grep/Q/INDX. Grep will basically do the same thing you're doing now (run through the file a line at a time, attempting to match each line to a regular expression), but it will execute as C++ code instead of an Igor loop, which might be substantially faster. Note that since Grep uses the same code under the hood as FReadLine to read the file, it too is slower than it could be (until IP9).

If you use the Grep approach, you have two options. The first is to call Grep twice, once with an expression for the start string and once for the end string. This will result in you looping through the file twice, but coming up with the regular expression is simpler and your code might be more reliable. The other option would be to create a regular expression that matches either the start or the end string, and call Grep only one time. The W_Index wave would then have 2 entries (presumably), and you would treat the first as the start row number and the second as the end row number. If you can make assumptions about your files this may be the best approach.

I also recommend that you change extractGND_En to return either a wave reference to the output wave (what is created by this line: "Duplicate/O GND_En_,$currentGNDname") or a string containing the name of the wave created by that same line. The reason I recommend this is that your WaveList command that you use later will give you a list of waves with undefined order. Currently the order is the order in which the wave was added to the data folder, which is probably what you want, but it is possible that future changes to your code (eg. loading in multiple threads) could result in the waves being added to the data folder in a different order, which might throw off your downstream code. An alternative would be to just call SortList on the output of WaveList to sort the wave names alphanumerically. That way your GNDList string always has a defined order.

These are all great suggestions!

Making the code more modular is definitely one of the things I'm aiming for so that seems like a good strategy to accomplish that.

I can definitely rely on the start string only happening once and the end string always happening after the start string so I could refactor the code to only loop through the file once.Thanks for pointing that out! Switching to GREP might be worthwhile since some of the files I'm reading can be quite large at times.

There is one more important thing that I failed to mention regarding the different sections and it's that they have different column structures. Therefore, I will need a different /B flag for the LoadWave command. This is one of the main reasons I ended up structuring my routine the way I did since I wasn't sure if there was a better way. Is there a workaround for this?

Thanks for all the help!

Here are the columnInfoString values I use for the different sections:

Section 1 columnInfoStr

columnInfoStr += "N='_skip_',W=1;"
columnInfoStr += "N='_skip_',W=20;"
columnInfoStr += "N='_skip_',W=3;"
columnInfoStr += "N='GND_En_',W=16;"
columnInfoStr += "N='_skip_',W=40;"

Section 2 columnInfoStr

columnInfoStr += "N='_skip_',W=2;"
columnInfoStr += "N='_skip_',W=4;"
columnInfoStr += "N='LUMO_En_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=11;"
columnInfoStr += "N='_skip_',W=14;"
columnInfoStr += "N='_skip_',W=14;"
columnInfoStr += "N='_skip_',W=14;"

Section 3 columnInfoStr

columnInfoStr += "N='_skip_',W=1;"
columnInfoStr += "N='LUMO_POS_',W=7;"
columnInfoStr += "N='OCCU_',W=7;"
columnInfoStr += "N='_skip_',W=15;"
columnInfoStr += "N='_skip_',W=6;"
columnInfoStr += "N='_skip_',W=6;"
columnInfoStr += "N='_skip_',W=7;"
columnInfoStr += "N='_skip_',W=15;"
columnInfoStr += "N='_skip_',W=6;"
columnInfoStr += "N='_skip_',W=6;"

 

Your different columnInfoStr strings could be input parameters to the common routine that loads a section of the file.

Thank you so much everyone for the suggestions! I got it working via a combination of the different approaches mentioned in the thread.