UPDATE: In the code snippet I originally provided, I used concatenation to build a running “total” of all data read so far. It is far better to use pre-allocation, especially for large datafiles. I left out code for things relating to preallocation because it obscures the main idea of my approach. I have updated the code to include pre-allocation, since one reader pointed out that the original code performs rather badly in practice.

Matlab has a very nice data import wizard that uses the function importdata. One immediate drawback is that the data import process stops without error as soon as non-numeric data is encountered. As your dataset gets large, it becomes increasingly difficult to determine whether all of your data was read correctly without a lot of manual checking. I usually use the function textscan combined with str2double to read all data as strings, and then automatically convert bad data to NaN values. This approach uses up an obscene amount of memory for large datasets. This article provides an alternative method.

Since textscan sets the end of file flag if it successfully reads the entire file, we merely need to check whether the flag is set; if it’s not, then there was an error, so we discard a line of data, and then continue importing with the next line. We continue repeating our attempts at textscan (and the subsequent checking) until we reach the end of the file. The following code snippet shows this method in action:

function [ total ] = robustTextScan( fileName, format, headerRows, lineEstimate, lineEstimateType, varargin )
%UNTITLED2 Summary of this function goes here
%   Detailed explanation goes here

fid = fopen(fileName);

temp = textscan(fid, format, 'HeaderLines', headerRows, varargin{:});
total = cell( size(temp) );

% pre-allocate as well as you can
if ~isempty(lineEstimate) && iscell(lineEstimateType) && length(lineEstimateType) == length(total)
    for i = 1:length(total)
        total{i} = repmat(lineEstimateType{i}, lineEstimate, 1);
    end
end

% if we haven't hit the EOF, it's because there was an error in the current
% line. otherwise, we're done
if ~feof(fid)
    
    lowerIndex = 1;
    
    while ~feof(fid)
        % find the shortest list; this is the one that had an error
        minLength = min(cellfun( @(x) size(x,1), temp));
        
        if minLength > 0
            
            % chop off everything up to the minimum length
            temp = cellfun( @(x) x(1:minLength,:), temp, 'UniformOutput', 0);
            
            % append the chopped off bit to the total output
            upperIndex = lowerIndex + minLength - 1;
            
            for i = 1:length(temp)
                total{i}(lowerIndex:upperIndex) = temp{i};
            end
            
            lowerIndex = upperIndex + 1;
        end
        
        % try again after the next line
        fgets(fid);
        
        % continue scanning and appending to the total
        temp = textscan(fid, format, varargin{:});
        
        minLength = min(cellfun( @(x) size(x,1), temp));
    end
    
    % append the chopped off bit to the total output
    upperIndex = lowerIndex + minLength - 1;
    
    % set the length equal to the terminal upper index
    for i = 1:length(temp)
        total{i}(lowerIndex:upperIndex) = temp{i};
        total{i} = total{i}(1:upperIndex);
    end
else
    total = temp;
end

fclose(fid);

end
Tagged with:
 

One Response to Dealing with occasionally non-numeric data in Matlab

  1. Anon says:

    Neat trick! Helpful as usual, thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Set your Twitter account name in your settings to use the TwitterBar Section.