Importing Data from DFT Calculations

Hi there,

I'm trying to import data from DFT calculations I've been running. The problem is that the data that I'm trying to import is not always located in the same line and the files are massive. The beginning of the data I want can be identified by a couple of keywords (i.e. Rigid spectral shift, NEXAFS, etc) as shown below from an excerpt of the original file:

Xray absorption (XAS, NEXAFS) calculation
Core hole found (by occ.) in alpha space, orbital # 6
Core hole located at center C1

Orbital energy core hole = -10.69511 H ( -291.03103 eV)
Rigid spectral shift = 0.00000 eV
Ionization potential = 291.03103 eV

Core -> unocc. excitations (X-ray absorption), dipole only :

E (eV) OSCL oslx osly oslz osc(r2)
---------------------------------------------------------------------------------------
# 1 287.0055 0.007305 0.014191 0.028940 0.000089 0.0000 72.7191
# 2 287.9669 0.000221 0.002439 0.005033 0.000011 0.0000 126.2999
# 3 288.4817 0.004180 -0.010701 -0.021838 -0.000099 0.0000 147.8532
# 4 289.1079 0.000177 0.002519 0.004267 -0.000622 0.0000 128.5743

My current workaround involves copying and pasting the portion of the data I want (everything in the first 6 columns below the dotted line) into a separate text file and importing that through IGOR, this is however, tedious since the original files are extremely large (can be in the millions of lines) and would like to automate the process if possible(example below):

E (eV) OSCL oslx osly oslz osc(r2)
---------------------------------------------------------------------------------------
1 287.0055 0.007305 0.014191 0.028940 0.000089 0.0000 72.7191
2 287.9669 0.000221 0.002439 0.005033 0.000011 0.0000 126.2999
3 288.4817 0.004180 -0.010701 -0.021838 -0.000099 0.0000 147.8532
4 289.1079 0.000177 0.002519 0.004267 -0.000622 0.0000 128.5743

Out of those millions of lines, I only need about 1000 that contain my data. Here is the current import procedure I'm using for my workaround:
#pragma TextEncoding = "UTF-8"
#pragma rtGlobals=3     // Use modern global access method and strict wave access.

Function LoadDFT(DFTdata,pathName)

    String DFTdata  //Desired name of file
    String pathName //Symbolic path where desired file is present
    String DFTFolder=GetDataFolder(1)      
    String foldername= "root:"+RemoveEnding(DFTData,".out") //Names folder by taking the file name and removing the .out, necessary for proper file parsing
    String DFTData2=RemoveEnding(DFTData,".out")            //Names the files by taking the file name and removing the .out ending
    String columnInfoStr = " "                                  //Contains set of names for each column in the .out file
    columnInfoStr += "C=1,F=0,W=3,N='_skip_';"
    columnInfoStr += "C=1,F=0,W=11,N=EnergyH_"+DFTdata2+";"
    columnInfoStr += "C=1,F=0,W=11,N=OS_"+DFTdata2+";"
    columnInfoStr += "C=1,F=0,W=11,N=TDMx_"+DFTdata2+";"
    columnInfoStr += "C=1,F=0,W=11,N=TDMy_"+DFTdata2+";"
    columnInfoStr += "C=1,F=0,W=11,N=TDMz_"+DFTdata2+";"
    columnInfoStr += "C=1,F=0,W=18,N='_skip_';"
    columnInfoStr += "C=1,F=0,W=18,N='_skip_';"
   
    NewDataFolder/O/S $foldername                           //Makes a data folder based
    print DFTData                                           //prints the loaded files
    LoadWave/J/B=columnInfoStr/D/W/E=0/K=0/V={"\t, "," $",1,1}/F={6,1,0}/N/O/Q/P=$pathName DFTData
End


I've attached a much smaller version of the DFT output that I'm trying to import. I'm using Igor 8 in case that helps. Any suggestions? Thanks for your help in advance!

Best,

Vic
example_1.txt
Do you know the exact line were loading should start? If yes you can use /L from LoadWave. If no you can try to find it via Open/FReadLine etc. If FreadLine is too slow and the file fits in memory you can also use FBinRead to read it into one shot.
So what you need to know is the start (and perhaps end) position of the data in the file.

You could use Open to open the file, FReadline in a loop with a counter to find the start position by text comparison, Close to close the file, then construct a LoadWave command to skip the unneeded lines and load the data.

Alternatively, Grep can be used to find a text marker in the file and return the line through V_startParagraph.
Hmmm, so I tried to modify hrodstein's example for my data set, but I keep getting the message "No data found in file". For some reason, it's not finding the keyword I specify. I've tried using different keywords but none are being registered/found. I think the problem might be due to the buffer or the text strings. Any thoughts?

#pragma rtGlobals=3     // Use modern global access method and strict wave access.

Function FindFirstDataLine(pathName, filePath)
    String pathName     // Name of symbolic path or ""
    String filePath         // Name of file or partial path relative to symbolic path.
 
    Variable refNum
 
    Open/R/P=$pathName refNum as filePath
   
    String buffer, text
    Variable line = 0
 
    do
        FReadLine refNum, buffer
        if (strlen(buffer) == 0)
            Close refNum
            //print "Can't find keyword"
            return -1                       // The expected keyword was not found in the file
        endif
        text = buffer[0,1]
        if (CmpStr(text," # ") == 0)       
            Close refNum
            return line + 1                 // Success: The next line is the first data line.
            print "Success!"
        endif
        line += 1
    while(1)
 
    return -1       // We will never get here
End

Function LoadDataFile(pathName, filePath, extension)
    String pathName     // Name of symbolic path or "" to display dialog.
    String filePath         // Name of file or "" to display dialog. Can also be full or partial path relative to symbolic path.
    String extension            // e.g., ".dat" for .dat files. "????" for all files.
 
    Variable refNum
 
    // Possibly display Open File dialog.
    if ((strlen(pathName)==0) || (strlen(filePath)==0))
        Open /D /R /P=$pathName /T=(extension) refNum as filePath
        filePath = S_fileName           // S_fileName is set by Open/D
        if (strlen(filePath) == 0)      // User cancelled?
            return -1
        endif
        // filePath is now a full path to the file.
    endif
 
    Variable firstDataLine = FindFirstDataLine(pathName, filePath)
 
    if (firstDataLine < 0)
        Printf "No data found in file %s\r", filePath
        return -1
    endif
 
    LoadWave /J /D /O /E=1 /K=0 /L={0,firstDataLine,1000,2,5} /P=$pathName filePath
 
    return 0
End
The first problem is that this:
text = buffer[0,1]


needs to be changed to this:

text = buffer[0,2]


since you are comparing to " # " (your target string) which is three bytes.

The next problem is that your target string appears in this line (line 2021, zero-based), which is before your data:
 Core hole found (by occ.) in alpha space, orbital #   6


I changed your FindFirstDataLine function to add this:
String targetString = " #   1"
Variable targetStringLength = strlen(targetString)


Then I changed this:
return line+1                   // Success: The next line is is the first data line.

to this:
// Print line   // For debugging only
return line                 // Success: This is is the first data line.


Now it prints the correct line number: 2032 (zero-based)

The next problem is that your file is space-delimited and the LoadWave operation defaults to comma and tab as delimiters. I fixed this by adding a /V flag:
LoadWave /J /D /O /E=1 /K=0 /L={0,firstDataLine,1000,2,5} /V={" ", "", 0, 0} /P=$pathName filePath


With that, it seems to do the right thing. That is, it loads this:
1 287.0055 0.007305 0.014191 0.02894 ...

Your next task is to change FindFirstDataLine to FindFirstAndLastDataLines.

But first, there is another problem. Lines 999 and 1000 of the data looks like this:
# 999 377.5523 0.000006 -0.000367 0.000177 -0.000682 0.0000 337.5637 #1000 377.7301 0.000005 -0.000327 0.000165 -0.000623 0.0000 151.3746

Because space is a delimiter, and there is no space after the # character in line 1000, that line appears to LoadWave to have one fewer column than line 999. This causes the wrong data to be loaded starting at line 1000. I will give some thought to how to fix this.
Your file is actually a FORTRAN-style fixed field file and so can be loaded using LoadWave/F instead of LoadWave/J.

I made this change. It also requires using the /B flag to specify the width in bytes of each column of data. /B also lets you name each column and skip whatever columns you don't want to load.

I also morphed FindFirstDataLine into FindFirstLineAndNumLines.

The result successfully loads your example file:

#pragma TextEncoding = "UTF-8"
#pragma rtGlobals=3     // Use modern global access method and strict wave access.

Function FindFirstLineAndNumLines(pathName, filePath, firstDataLine, numDataLines)
    String pathName     // Name of symbolic path or ""
    String filePath         // Name of file or partial path relative to symbolic path
    Variable &firstDataLine // Pass-by-reference output
    Variable &numDataLines  // Pass-by-reference output

    firstDataLine = -1
    numDataLines = -1
 
    Variable refNum
 
    Open/R/P=$pathName refNum as filePath
 
    String buffer, text
    Variable line = 0
   
    String targetString = " #   1"
    Variable targetStringLength = strlen(targetString)
   
    // Find first line
    do
        FReadLine refNum, buffer
        if (strlen(buffer) == 0)
            Close refNum
            return -1                       // The expected keyword was not found in the file
        endif
        text = buffer[0,targetStringLength-1]
        if (CmpStr(text,targetString) == 0)
            firstDataLine = line
            break                           // This is is the first data line
        endif
        line += 1
    while(1)
   
    // Find last line
    targetString = " #"
    targetStringLength = strlen(targetString)
    do
        FReadLine refNum, buffer
        if (strlen(buffer) == 0)
            // Ran out of lines - assume this is the last line of data
            line += 1
            break
        endif
        text = buffer[0,targetStringLength-1]
        if (CmpStr(text,targetString) != 0) // Line does not start with "<space>#>?
            // This is is the line after the last data line
            break  
        endif
        line += 1
    while(1)
   
    numDataLines = line - firstDataLine + 1
   
    // Print firstDataLine, numDataLines        // For debugging only

    Close refNum

    return 0        // Success
End
 
Function LoadDataFile(pathName, filePath, extension)
    String pathName     // Name of symbolic path or "" to display dialog.
    String filePath         // Name of file or "" to display dialog. Can also be full or partial path relative to symbolic path.
    String extension            // e.g., ".dat" for .dat files. "????" for all files.
 
    Variable refNum
 
    // Possibly display Open File dialog.
    if ((strlen(pathName)==0) || (strlen(filePath)==0))
        Open /D /R /P=$pathName /T=(extension) refNum as filePath
        filePath = S_fileName           // S_fileName is set by Open/D
        if (strlen(filePath) == 0)      // User cancelled?
            return -1
        endif
        // filePath is now a full path to the file.
    endif
 
    Variable firstDataLine, numLines
    Variable result = FindFirstLineAndNumLines(pathName, filePath, firstDataLine, numLines)
    if (result != 0)
        Printf "No data found in file %s\r", filePath
        return -1
    endif

    // Example Data:
    // #   1   287.0055  0.007305   0.014191   0.028940   0.000089      0.0000       72.7191
    // # 999   377.5523  0.000006  -0.000367   0.000177  -0.000682      0.0000      337.5637
    // #1000   377.7301  0.000005  -0.000327   0.000165  -0.000623      0.0000      151.3746
   
    String columnInfoStr = ""       // Prepare parameter for /B flag
    columnInfoStr += "N='_skip_',W=2;"
    columnInfoStr += "N='Column1',W=4;"
    columnInfoStr += "N='Column2',W=11;"
    columnInfoStr += "N='Column3',W=10;"
    columnInfoStr += "N='Column4',W=11;"
    columnInfoStr += "N='Column5',W=11;"
    columnInfoStr += "N='_skip_',W=11;"
    columnInfoStr += "N='_skip_',W=12;"
    columnInfoStr += "N='_skip_',W=14;"
   
    LoadWave /F={9, 11, 0} /B=columnInfoStr /D /O /E=1 /K=0 /L={0,firstDataLine,numLines,0,0} /P=$pathName filePath
 
    return 0
End

Function Test()
    LoadDataFile("home", "Sample.txt", ".txt")
End
That worked beautifully! Thank you very much hrodstein for the detailed explanation.

Best wishes,

Vic