Counting the number of atoms for an element in a chemical formula

Hello, 

I have a 2D wave wherein column 1 is a list of chemical formulas and column 2 is their intensity in a soft ionization mass spectra. Now I want to calculate elemental ratios for each of these formulas (e.g. O:C, N:C, H:C and S:C ratios). For this, I am trying to figure out a way for the program to read numbers between text. 

For example: C6H10O5. I want the program to be able to figure out how many C, H and O atoms are in this formula. Kindly advise how to optimally do this. I was trying to use regex expressions but couldn't iron out the logic. 

The issue is that in some formulas there will be double digits for an element while a single digit in others (e.g. C6 vs C16). So, the program needs to be able to recognize when it encounters a text as opposed to a number while counting for an element. 

How could I best execute this task? 

Thanks a ton! 

Best regards, 

Peeyush

For your simple case, I constructed this and it seems to work, i.e., the function extracts the number behind the species letter:

Function countAtoms(String formula, String species)
    if (strsearch(formula, species,0,2) == -1)
        return 0
    endif
    String count
    SplitString/E=(".*"+species+"([[:digit:]]+).*") formula, count
    return str2num(count)
End

For example:

print countAtoms("C6H10O5","H")
  10

In reply to by chozo

You might need to deal with the case of no digit(s) after the element.

I was working on the following when the replies were posted:

Function getElmntCount(sElmnt, sChem)
    string sElmnt
    string sChem
    string sRegEx = "([" + sElmnt + "]{1})([[:digit:]]+){1}"
    variable vCount = 0
    string sDummy, sCount
    if( GrepString( sChem, sRegEx )  )
        SplitString /E=sRegEx sChem, sDummy, sCount
        vCount = str2num( sCount )
    else
        sRegEx = "([" + sElmnt + "]{1})"
        if ( GrepString(sChem, sRegEx) == 1 )
            vCount = 1
        endif
    endif
    return vCount
End

Function getCHONC(sChem)
    string sChem
    variable vC, vH, vO, vN, vS
     vC = getElmntCount("C", sChem)
     vH = getElmntCount("H", sChem)
     vO = getElmntCount("O", sChem)
     vN = getElmntCount("N", sChem)
     vS = getElmntCount("S", sChem)
    printf "C: %g, H: %g, O: %g, N: %g, S: %g\r", vC, vH, vO, vN, vS
End

 

Thank you so much, tony, chozo and KurtB for these awesome suggestions! I am going to try them out and let you know in case I'm still stuck. Really appreciate the help! 

I would probably search through the string one character at a time.

1. The first character should be upper case
2. Check if the next character is lower case, numeric or upper case
3. Save lower case, if present
4. Save numeric, if present
5. Once you reach a new upper case character you start from 1

At the end you probably want to check the letter combinations against a list of valid elements

function/S elementlist(string formula)
    string strList = ""
    string strCount, element
    variable vCount, vAtoms
   
    if (GrepString(formula, "^[A-Za-z0-9\.]+$") == 0)
        return ""
    endif
   
    do
        SplitString/E=("(^[A-Z][a-z]*)([[:digit:],\.]*)(.*)") formula, element, strCount, formula
        if (strlen(element))
            vCount = NumberByKey(element, strList)
            vCount = numtype(vCount) ? 0 : vCount
            vAtoms = str2num(strCount)
            vAtoms = numtype(vAtoms) ? 1 : vAtoms
            strList = ReplaceNumberByKey(element, strList, vCount + vAtoms)
        else
            return ""
        endif
    while (strlen(formula))
    return strList
end

 

Thanks all again for your excellent solutions! 

@tony: thanks! Upon running elementlist ("C6H10O5"), I get C:6; H:10;O:5; as a text solution which is great. However, I want to store the counts in specific numeric waves so I could calculate the O:C, N:C, S:C ratios etc. 

One quick issue is that the output from elementlist will have or miss elements depending on which ones are present in the chemical formula. So how to assign element counts to individual numeric waves for each element? 

For example, I have a 1D wave with chemical formulas. Now I want 5 other 1D waves (or lets say 1 2D wave with 5 columns) where each column stores numeric count values of C, H, O, N, S. When an element is not present in the formula, it gets assigned a value of NaN in the 2D wave. I think if I start with a fully NaNed 2D wave, that will take care of the assigning NaNs to absent elements. However, I still need to assign the result from the text output as numeric values in individual columns of the 2D wave. 

Actually I haven't used "Numberbykey" much before so this is quite a learning experience for me. Not sure how to extract numbers from your code and assign them to 1D waves.. 

Kindly advise (all). Thanks a ton! 

If you want to use the keylist approach you can extract values from the list using the NumberByKey function. So you could put the keylists in a textwave and then write something like Wave2D[][%C] = NumberByKey("C", ListWave). But, if you're working with waves, you may be better off with chozo's approach: Wave2D[][%C] = countAtoms(FormulaWave, "C").

If the columns have elements as dimension labels you can also write

Wave2D = countAtoms(FormulaWave[p], GetDimLabel(Wave2D, 1, q))

Edit: I mixed up the names of the functions by chozo and KurtB

If you will choose my code, rather use the revised function below, which can also count single atoms:

Function countAtoms(String formula, String species)
    if (strsearch(formula, species,0,2) == -1)
        return 0
    endif
    String countStr
    SplitString/E=(".*"+species+"([[:digit:]]+).*") formula, countStr
    Variable count = str2num(countStr)
    return numtype(count) == 0 ? count : 1
End
print countAtoms("H2O","O")
  1

 

And, for completeness, there's this horribly inefficient version:

Wave2D = NumberByKey(GetDimLabel(Wave2D, 1, q), elementlist(FormulaWave[p]))

Thanks a lot, tony and chozo, for your responses. Below is what I have set up based on your codes, which does the job. I am going to read up more on the NumberbyKey function: 

Function elements(specieswave)

 
wave/t specieswave
variable i,j,num = dimsize(specieswave,0)

 
dfref dfr=root:Parameters:
svar/sdfr=dfr elements

 
variable ecount = itemsinlist(elements)

 
make/d/o/n=(num,ecount)elementscount = nan

 
make/d/o/n=(num,ecount-1)elementsratios = nan

 
for(i=0;i<ecount;i+=1)
SetDimLabel 1,i,$stringfromlist(i,elements),elementscount
endfor 

 
for(i=1;i<ecount;i+=1)
SetDimLabel 1,i-1,$stringfromlist(i,elements)+"/C",elementsratios
endfor 

 

 
for(j=0;j<ecount;j+=1)
for(i=0;i<num;i+=1)
elementscount[i][%$stringfromlist(j,elements)] = countAtoms(specieswave[i],stringfromlist(j,elements))
endfor
endfor

 
for(j=1;j<ecount;j+=1)
elementsratios[][%$stringfromlist(j,elements)+"/C"] = elementscount[p][%$stringfromlist(j,elements)]/elementscount[p][0]
endfor

 
elementsratios = numtype(elementsratios) == 1 ? nan : elementsratios

 

 
edit elementsratios
edit elementscount
end 

 

Note that you can use an implicit loop rather than nested loops to set values in a wave:

// use this

elementscount = countAtoms(specieswave[p],stringfromlist(q,elements))

// instead of this

for(j=0;j<ecount;j+=1)

   for(i=0;i<num;i+=1)

      elementscount[i][%$stringfromlist(j,elements)] = countAtoms(specieswave[i],stringfromlist(j,elements))

   endfor

endfor

 

Thanks, tony! I'd certainly prefer to cut out multiple loops wherever possible. This helps a lot.