Remove non-readable characters from a string

Hi,

I am loading a header of a tiff image. The header was successfully loaded as a string but it contains some non-readable characters.

Is it possible to remove these non-readable characters from the string?

Function/S GetHeader(path, endline)

    String path // file path
    Variable endline  // terminal line for the header, line number starts from zero.
   
    String buffer
    Variable refNum
   
    Open/R refNum as path
    string header = ""
    variable i = 0
    Do
        FReadLine refNum, buffer //Read one line each until find the specified string
        header += buffer
        i++
    while(i<endline)  // to avoid loading all lines. if your header is larger than 100 line, increase the number.
   
    return header

End

 

header loaded as a wave note test.tif_.zip

I think you could use CleanupName to do that if the lines are <255 characters long e.g.:

    Do
        FReadLine refNum, buffer //Read one line each until find the specified string
        header += CleanupName(buffer,1) // nix non-ASCII characters
        i++
    while(i<endline)  // to avoid loading all lines. if your header is larger than 100 line, increase the number.

Else you might have to add a sub loop for longer buffers.

First, you might consider using Igor's ImageLoad operation to load your tiff file. It creates an S_info output variable. While that isn't currently documented (I have asked the author to document it), it contains the "# Pixel_size..." string that I think you are trying to get. ImageLoad contains flags that allow you to read the tags from the image. You can also read just the metadata without reading in the actual image (/RTIO flag).

If you want to eliminate the non-printable characters (most of which are due to null bytes in the string), you could use this ConvertTextEncoding command:

String converted = ConvertTextEncoding(header, 1, 1, 3, 2)

That will treat the input string as UTF-8 (ASCII is valid UTF-8) and convert to UTF-8. Any invalid UTF-8 sequences will be silently dropped.

Thank you very much for the help. ImageLoad/RTIO works perfectly.

I have also tried the CleanupName and ConvertTextEncoding, but these codes could not remove the non-readable characters.

 

I have done this before using brute force. Use char2num for each character in the string and compare with acceptable ranges.  

0-9 is 48-57, A=65, Z=90, a=97, z=122, and then you need to add what other characters you will allow (=, . , ;, etc). And then brute force through the string, taking character one at time from input string and appending to output string only the acceptable ones. I was looking at our Python implementation of similar function and it does something similar. It will not be pretty of fast, but these headers cannot be too long anyway. 

Alternative is to switch the Pilatus you have into writing Nexus file - if the software support you have handles it (I heard newer ones do). Nexus (=HDF5) Igor can read natively in IP9 and with xop in IP8 and before. 

I have created a code snippet that addresses the original question: https://www.wavemetrics.com/code-snippet/remove-non-printable-character…

Note that when loading a TIFF image you will probably be better off using ImageLoad directly to load the image. If you need the information in the header, you can use the /RAT flag (or /LTMD flag if you use /BIGT=1). That will store information that is in the header in waves in a Tagn data folder. If you don't care about the image data and just want the metadata, you can use the /RTIO flag.