HTML table to text and numeric waves

I would like to access some data that comes in the form of a table in a webpage, turning it into text and numeric waves in IGOR. Has anyone developed an approach for converting HTML text to Igor waves?

Currently I'm doing the following workflow which works for one specific webpage but is pretty kludgy and probably not very general.
  • Import the HTML code as a string S_ServerResponse using the command URLRequest (Igor 7 specific)

  • Massage the code using a long list of ReplaceString commands, i.e. S_ServerResponse=ReplaceString("<td>",S_ServerResponse,";") or S_ServerResponse=ReplaceString(" align=\"center\"",S_ServerResponse,"") etc. etc.

  • Make a 2-D text wave to hold the table items

  • Fill the 2-D wave using a nested "for" loop and a nested StringFromList command (e.g. StringFromList(j,StringFromList(i,S_ServerResponse,"|"),";"))


  • Now I have a 2-D text wave with the HTML table contents. Next I'd like to convert each column to the appropriate kind of wave, either text, numeric, or date/time. Currently I'm exporting the 2D table to a CSV file using Save, then re-importing it using LoadWave such that Igor automatically determines the column type. Is there a cleaner way to do that without the save/load?
    You might be able to use the Grep operation's ability to store the output in a text wave to your advantage here. You might need to first do some preprocessing, but you could then copy your string to the clipboard using PutScrapText and then use the Grep operation to do the final step of converting the text into a text wave using a regular expression. That's still pretty kludgy, but maybe kludgy in a faster way :)

    ajleenheer wrote:

    Now I have a 2-D text wave with the HTML table contents. Next I'd like to convert each column to the appropriate kind of wave, either text, numeric, or date/time. Currently I'm exporting the 2D table to a CSV file using Save, then re-importing it using LoadWave such that Igor automatically determines the column type. Is there a cleaner way to do that without the save/load?

    I can't think of any other automatic way to do this. I think your approach is actually pretty clever.

    Is there any chance you can get the data from the web server as JSON?
    Well, your comment about Grep sent me down the dark hole of regular expressions, but I didn't quite make it to the bottom. So far this is my working code, tested on only a couple of webpages. If anyone has suggestions about better ways to clean up a given cell using some regular expression and perhaps the SplitString command, it would be welcome! Currently I'm using "strsearch" to find a ">" character, but I could imagine situations where that might not work.

    function ParseHTMLTable(htmltext)
        //Example usage: ParseHTMLTable(FetchURL("http://myurlwithtable.com"))
        //This attempts to load the last table in a webpage into IGOR waves. Doesn't work with nested tables.
       
        string htmltext
        htmltext=ReplaceString("\n",htmltext,"")//delete any line breaks
       
        //Get the last table: create a list where each "<table" entry is separated by ^
        htmltext=ReplaceString("<table",htmltext,"^<table")
        htmltext=StringFromList(ItemsInList(htmltext,"^")-1,htmltext,"^") //Assuming the table of interest is the last table on the webpage
       
        //Determine if the table has headers: any <th> tags?
        variable headerexists=StringMatch(htmltext,"*<th*")
       
        //Create a list based on tr and td and th; rows are separated by ~ and columns are separated by `
        htmltext=ReplaceString("</tr>",htmltext,"~")
        htmltext=ReplaceString("</td>",htmltext,"`")
        htmltext=ReplaceString("</th>",htmltext,"`")

        variable nrows =ItemsInList(htmltext,"~")-1 //subtract 1 because there's always a dummy row containing </table>
        variable ncols=ItemsInList(StringFromList(0,htmltext,"~"),"`")
        Print "Importing HTML table with "+num2str(nrows)+" rows and "+num2str(ncols)+" columns"
       
        //Put the (messy with HTML code) items into a text wave
        Make/O/T/N=(nrows,ncols) htmlTable=StringFromList(q,StringFromList(p,htmltext,"~"),"`") //each table row goes into a row of textWave
       
        //Clean up each item of the text wave in turn, given that each entry is proceeded by >
        variable i,j
        variable arrowposition=0
        string cellstring=""
        for (i=0;i<nrows;i+=1)
            for(j=0;j<ncols;j+=1)
                cellstring=htmlTable[i][j]
                arrowposition=strsearch(cellstring,">",inf,1)
                htmlTable[i][j]=cellstring[arrowposition+1,strlen(cellstring)]
            endfor
        endfor
       
        //Create a string containing the wave names based on the headers if they exist
        string wavenames=""
        for(i=0;i<ncols;i+=1)
            if(headerexists)
                wavenames=wavenames+"N='"+CleanupName(htmltable[0][i],1)+"';"
            else
                wavenames=wavenames+"N='htmlColumn"+num2str(i+1)+"';"
            endif
        endfor
        if(headerexists)
            Print "Getting wave names from table header."
            DeletePoints/M=0 0,1,htmlTable //remove the header row now that we know the names
        endif
       
        Edit htmltable //display the text table
       
        //Rather than messing about with column types, let's just save & reload to let Igor worry about column types
        Save/J/O/P=IgorUserFiles htmlTable as "HTMLtable.txt" //delimiter is tab; assumes table has no tabs embedded
        LoadWave/E=1/Q/O/A/J/V={"\t"," $",0,0}/D/B=waveNames/K=0/P=IgorUserFiles "HTMLtable.txt" //set E=0 if the loaded waves should not be displayed in a table
    end
    I have to ask ... Wasn't HTML translation at one time supposedly made easy to handle by XSLT? I dimly remember an XSLT input routine that I wrote for Igor Pro nearly a few decades ago. Aren't such translations now routinely handled through JSON or JavaScript or Python or ...??? I ask the latter question because, as I recall that I did with the XSLT routine, it may (have been) / (still be) easier to download a robust conversion script (HTML Table -> matrix / array data) and execute it from Igor Pro through a shell / batch script.

    --
    J. J. Weimer
    Chemistry / Chemical & Materials Engineering, UAH
    ajleenheer wrote:
    Well, your comment about Grep sent me down the dark hole of regular expressions, but I didn't quite make it to the bottom. So far this is my working code, tested on only a couple of webpages. If anyone has suggestions about better ways to clean up a given cell using some regular expression and perhaps the SplitString command, it would be welcome! Currently I'm using "strsearch" to find a ">" character, but I could imagine situations where that might not work.

    You might get better responses if you include a link to the source HTML that you're processing, and if you clarified exactly what it is that you want the output to look like. Reading only your code, it's hard to get an idea of what the source data looks like.
    I'd be open to any robust shell scripts that convert HTML tables to JSON, XML etc. if you know of any. I didn't find much in my web searching, though the "tidy" utility appeared somewhat useful. However, I was hoping for a self-contained IGOR procedure so that this code could be portable to different machines and operating systems without auxiliary files.

    aclight wrote:
    include a link to the source HTML that you're processing, and if you clarified exactly what it is that you want the output to look like.


    I've attached an example HTML file similar to what I'm processing that contains some nonsense data. For output, I want each column loaded into Igor as the appropriate kind of 1D wave. My code works relatively well on this example, other than the column with values separated by commas. I've only tested it on a couple other webpages. To load the example file, run from the command line:

    variable refNumHTML
    string buffer
    Open/R refNumHTML as ""
    FReadLine/T="" refNumHTML, buffer
    Close/A
    ParseHTMLTable(buffer)
    ExampleHTML.txt
    It looks like your code is handling the column with the values like "2,5,2,12" correctly as your htmlTable wave looks right. I suspect the problem is that when you save to a file and then load using LoadWave, Igor is truncating the value in that "column" at the comma since that's not part of a number. You might need to specify the type of that particular column in the string you use with the /B flag of LoadWave.
    If you are open to non-Igor solutions for this, you can use nokogiri to parse xml/html. The output can then be loaded into Igor and processed however you like.