Find longest common substring in a list?

Greetings all,

I'm wondering if there is a known method (built-in or code snippets) to solve the following problem.

I have several lists of strings whose entries are generally of the form "SampleID_ReplicateID;".

SampleID and ReplicateID are manually entered by the operators when collecting data and the combined entry is automatically applied to the output. SampleID is entered once for a sample, and ReplicateID is entered for every replicate run. Each list represents one SampleID.

Unfortunately I cannot enforce a format for the ReplicateID. It is freeform text. It could be 1,2,3,... or A,B,C,... or even Jane,Paul,Mary,...

I want to extract the SampleID by itself from each List.

SampleID and ReplicateID may contain "_" so it is not as simple as searching for that character.

But since SampleID will be the same in each list, I basically want to search for the longest substring that is common amongst each list item. (If there are common leading characters in the ReplicateID I may try to deal with them, but I don't think that will actually affect my subsequent processing.)

Any suggestions?

 

--Andreas

Edit: I had posted an idea but then I re-read your post and realized I was not understanding the problem. If you could post a couple of example lists, that would help me understand the problem better.

In reply to by gsb

Here is an example list

P20_C1_F1_R3__1903_036;P20_C1_F1_R3__1903_037;P20_C1_F1_R3__1903_038;P20_C1_F1_R3__1903_039;P20_C1_F1_R3__1903_040;

 

In this case, the SampleID = P20_C1_F1_R3__1903

The longest common substring would actually be P20_C1_F1_R3__1903_0 (due to the leading zero on the ReplicateID), but I can probably handle that downstream.

Another example is

P7_PRE_B;P7_1_A;P7_1_B;P7_2_A;P7_2_B;

Here SampleID = P7, which is also the longest substring after removal of the terminal "_".


 

In reply to by topchem

If sampleID is always the start of each list item, you could take a simple brute-force approach and figure out how many characters it is until the strings don't match. E.g., the following seems to work if you run testit()

function/S longestCommonStartStr(list,[listWithoutStartStr])
    String list
    String &listWithoutStartStr //optional -- will hold list without startStr
   
    WAVE/T txt = listtotextwave(list,";")  
    String testChar=(txt[0])[0],commonStartStr=""
    Variable i,items=dimsize(txt,0),count,doBreak=0
   
    for (count=0;strlen(testChar)>0;)
       
        for (i=1;i<items;i+=1)
            if (!stringmatch( (txt[i])[count],testChar))
                doBreak=1
                break
            endif
        endfor 
       
        if (doBreak)
            break
        endif
       
        commonStartStr+=testChar
        count+=1
        testChar = (txt[0])[count]
    endfor
   
    if ( !paramIsDefault(listWithoutStartStr) )
        txt = ReplaceString(commonStartStr,txt[p],"",0,1)
        wfprintf listWithoutStartStr,"%s;",txt
    endif
   
    return commonStartStr
   
end

function testit()
    String blah
    print longestCommonStartStr("aaadaab;aaaaa;aaaaac;",listWithoutStartStr=blah)
    print blah
end

 

If you have the name as a string wave you could sort the wave alphabetically with Sort and then find the longest common string of the first and last strings in the wave. That way you only have to compare two strings and not all of them.

In reply to by olelytken

olelytken wrote:

If you have the name as a string wave you could sort the wave alphabetically with Sort and then find the longest common string of the first and last strings in the wave. That way you only have to compare two strings and not all of them.

With Igor 7 or later you can use the ListToTextWave function to convert a string list into a text wave. However for the purposes of this question you could also just use SortList (probably on a copy of the string list) without first converting the string list into a text wave.

This incorporates the suggestion from Olelytken and Adam. It avoids an explicit for loop.

Function/S get_longestcommonprefix(strlist,[oend])
    string strlist,oend
   
    if (ParamIsDefault(oend))
        oend = "_"
    endif
   
    string rtstr = "",fstr,lstr
    variable npts
   
    npts = ItemsInList(strlist)
    strlist = SortList(strlist)
    fstr = StringFromList(0,strlist)
    lstr = StringFromList(npts-1,strlist)
   
    npts = min(strlen(fstr),strlen(lstr))
    make/N=(npts)/FREE rwstr   
    rwstr = (cmpstr(fstr[p],lstr[p])==0) ? 1 : 0

    FindValue/V=0 rwstr
    npts = V_value - 1
   
    rtstr = RemoveEnding(fstr[0,npts],oend)
   
    return rtstr
   
end

I had initially thought to build a version that built a string (wave) as rwstr = ... ? faster[p] : "". As noted in a separate thread, the logical comparison would use SelectString(cmpstr(...),fstr[p],""). I don't know of any advantages unless the string can be built directly rather than using a text wave that must then be converted back to a string.

Using p instead of a for loop is usually faster. You could probably also multithread the p line, although I suspect the slowest bit in the code is the SortList line.

MultiThread rwstr = (cmpstr(fstr[p],lstr[p])==0) ? 1 : 0

 

By habit, I try to avoid explicit for loops when I can construct implicit ones. They should generally be faster and they are easier for me to debug visually.

When the strings in the list all have at least one character past a common prefix, you can eliminate the list sorting and choose any two of the items in the list (e.g. the first and last).

With only 20 list items I think the solution from jjweimer is fine. With more than a couple of thousands switchting to using textwaves instead could be faster.