Simple CSV parser in C# (with comma-in-cell support!)

Some time ago I needed a simple solution in pure C# for reading and writing CSV files. Surprisingly, I couldn’t find any premade code in Google and had to write it on my own.

What is CSV and where’s the catch?

CSV stands for comma-separated values. These files are used to store tabular data. For example, if we have:

John Smith jsmith@example.com 250 15
Mary Bowden mbowden@example.com 291 7
Edward Miller emiller@example.com 174 10

…we can save it in CSV file containing:

John Smith,jsmith@example.com,250,15
Mary Bowden,mbowden@example.com,291,7
Edward Miller,emiller@example.com,174,10

Simple as that. Reading these files should be easy, right? One line equals one row and each row can be divided into cells using comma as a separator. Writing – analogously. There are even some code snippets in Google, so you don’t need to implement it by yourself. Job’s done, isn’t it?

Unfortunately not. There’s a catch: what if one of our cells contains a comma? Our simple algorithm would break then, recognizing it as two cells.

And that’s the point where things get tricky. CSV format is not standardized, so there is no standard way of handling these situations. You just cannot make it work in every environment. Fortunately, I had one requirement: my CSV files should be compatible with Excel. And saying “compatible” I mean double-click and there you go, no importing needed. This turned out later to cause some serious problems… But, at this point, I had no worries: my scripts should behave exactly as Excel in this situation.

How does Excel handle CSV files?

After a few iterations of saving CSV files in Excel and opening them in Notepad I observed two things:

  • Firstly, if a cell contains a comma, Excel adds quotation marks at the beginning and at the end of a cell.
  • Secondly, if this quoted cell contains quotations marks inside, Excel doubles them. So single quotation mark means “special” character (added by Excel). On the other hand, double quotation mark means “normal” character (added by the user).

To make things clear, let’s take a look at the example:

Normal line Some value
Line with comma Some, value
Line with quotes Some “value”
Line with comma and quotes Some, “value”
Line with double quotes Some “”value””

It’s going to be saved as:

Normal line,Some value
Line with comma,"Some, value"
Line with quotes,"Some ""value"""
Line with comma and quotes,"Some, ""value"""
Line with double quotes,"Some """"value"""""

As you see, if a cell contains comma or quotation mark, it gets surrounded by single quotes. Every single quote inside a cell is doubled (that’s why ""value"" changed into """"value""""). Quite easy to implement, so let’s write some code!

Config

First of all, we’re going to create simple Config class to avoid some hardcoding. It turned out to be particularly useful when we encountered internationalization problems. But – more on that later.

public class CsvConfig
{
    public char Delimiter { get; private set; }
    public string NewLineMark { get; private set; }
    public char QuotationMark { get; private set; }

    public CsvConfig(char delimiter, string newLineMark, char quotationMark)
    {
        Delimiter = delimiter;
        NewLineMark = newLineMark;
        QuotationMark = quotationMark;
    }

    // useful configs

    public static CsvConfig Default
    {
        get { return new CsvConfig(',', "\r\n", '\"'); }
    }

    // etc.
}

Writer class

Here’s CsvWriter class:

public class CsvWriter
{
    private CsvConfig m_config;
    private StringBuilder m_csvContents;

    public CsvWriter(CsvConfig config = null)
    {
        if (config == null)
            m_config = CsvConfig.Default;
        else
            m_config = config;

        m_csvContents = new StringBuilder();
    }

    public void AddRow(IEnumerable<string> cells)
    {
        int i = 0;
        foreach (string cell in cells)
        {
            m_csvContents.Append(ParseCell(cell));
            m_csvContents.Append(m_config.Delimiter);

            i++;
        }

        m_csvContents.Length--; // remove last delimiter
        m_csvContents.Append("\r\n");
    }

    private string ParseCell(string cell)
    {
        // cells cannot be multi-line
        cell = cell.Replace("\r", "");
        cell = cell.Replace("\n", "");

        if ( ! NeedsToBeEscaped(cell))
            return cell;

        // double every quotation mark
        cell = cell.Replace(m_config.QuotationMark.ToString(), string.Format("{0}{0}", m_config.QuotationMark));

        // add quotation marks at the beginning and at the end
        cell = m_config.QuotationMark + cell + m_config.QuotationMark;

        return cell;
    }

    private bool NeedsToBeEscaped(string cell)
    {
        if (cell.Contains(m_config.QuotationMark.ToString()))
            return true;

        if (cell.Contains(m_config.Delimiter.ToString()))
            return true;

        return false;
    }

    public string Write()
    {
        return m_csvContents.ToString();
    }
}

The code should be quite easy to understand. CsvWriter has only two public methods: AddRow() and Write(). The latter writes all rows added to CsvWriter into a string. The result is ready to be saved as a CSV file.

The magic happens earlier, in the AddRow() method. Every row is IEnumerable of strings, each one representing one cell. Every cell is parsed and then added to result with delimiter character (i.e. comma).

In our case cell cannot be multi-line, so parsing removes new line characters. It is important because they are used to distinguish different rows. Then CsvWriter checks if the cell needs to be escaped (i.e. contains commas or quotes). If yes, every quote is doubled and the whole cell is decorated with single quotes at the beginning and at the end.

Usage is also straight-forward:

var writer = new CsvWriter();
writer.AddRow(new string[] {"a", "b", "c"});
writer.AddRow(new List<string> {"1", "2", "3"});
string csv = writer.Write();
File.WriteAllText("file.csv", csv);

Reader class

Reader is a bit more sophisticated. Let’s take a look:

public class CsvReader
{
    private CsvConfig m_config;

    public CsvReader(CsvConfig config = null)
    {
        if (config == null)
            m_config = CsvConfig.Default;
        else
            m_config = config;
    }

    public IEnumerable<string[]> Read(string csvFileContents)
    {
        using (StringReader reader = new StringReader(csvFileContents))
        {
            while (true)
            {
                string line = reader.ReadLine();
                if (line == null)
                    break;
                yield return ParseLine(line);
            }
        }
    }

    private string[] ParseLine(string line)
    {
        Stack<string> result = new Stack<string>();

        int i = 0;
        while (true)
        {
            string cell = ParseNextCell(line, ref i);
            if (cell == null)
                break;
            result.Push(cell);
        }

        // remove last elements if they're empty
        while (string.IsNullOrEmpty(result.Peek()))
        {
            result.Pop();
        }

        var resultAsArray = result.ToArray();
        Array.Reverse(resultAsArray);
        return resultAsArray;
    }

    // returns iterator after delimiter or after end of string
    private string ParseNextCell(string line, ref int i)
    {
        if (i >= line.Length)
            return null;

        if (line[i] != m_config.QuotationMark)
            return ParseNotEscapedCell(line, ref i);
        else
            return ParseEscapedCell(line, ref i);
    }

    // returns iterator after delimiter or after end of string
    private string ParseNotEscapedCell(string line, ref int i)
    {
        StringBuilder sb = new StringBuilder();
        while (true)
        {
            if (i >= line.Length) // return iterator after end of string
                break;
            if (line[i] == m_config.Delimiter)
            {
                i++; // return iterator after delimiter
                break;
            }
            sb.Append(line[i]);
            i++;
        }
        return sb.ToString();
    }

    // returns iterator after delimiter or after end of string
    private string ParseEscapedCell(string line, ref int i)
    {
        i++; // omit first character (quotation mark)
        StringBuilder sb = new StringBuilder();
        while (true)
        {
            if (i >= line.Length)
                break;
            if (line[i] == m_config.QuotationMark)
            {
                i++; // we're more interested in the next character
                if (i >= line.Length)
                {
                    // quotation mark was closing cell;
                    // return iterator after end of string
                    break;
                }
                if (line[i] == m_config.Delimiter)
                {
                    // quotation mark was closing cell;
                    // return iterator after delimiter
                    i++;
                    break;
                }
                if (line[i] == m_config.QuotationMark)
                {
                    // it was doubled (escaped) quotation mark;
                    // do nothing -- we've already skipped first quotation mark
                }

            }
            sb.Append(line[i]);
            i++;
        }

        return sb.ToString();
    }
}

This time we have only one public method, Read(). It gets CSV file as a string, reads it line by line (i.e. row by row), and yields each row as an array of strings (i.e. cells).

Take a closer look at ParseLine():

private string[] ParseLine(string line)
{
    Stack<string> result = new Stack<string>();

    int i = 0;
    while (true)
    {
        string cell = ParseNextCell(line, ref i);
        if (cell == null)
            break;
        result.Push(cell);
    }

    // remove last elements if they're empty
    while (string.IsNullOrEmpty(result.Peek()))
    {
        result.Pop();
    }

    var resultAsArray = result.ToArray();
    Array.Reverse(resultAsArray);
    return resultAsArray;
}

ParseLine() uses ParseNextCell() to get subsequent cells and stores them in stack. Why did I choose this data structure? We sometimes had many empty cells at the end of every row in our CSV files. I wanted to have a nice and efficient way to get rid of them without removing empty cells in the middle of the row. Having the stack of cells, it’s enough to pop it as long as Peek() returns empty cell.

Then we convert it to the array – and because we had a stack, items will be stored in a reversed order. Thus one additional step (reversing array) is needed.

The most important part starts with ParseNextCell():

// returns iterator after delimiter or after end of string
private string ParseNextCell(string line, ref int i)
{
    if (i >= line.Length)
        return null;

    if (line[i] != m_config.QuotationMark)
        return ParseNotEscapedCell(line, ref i);
    else
        return ParseEscapedCell(line, ref i);
}

As you probably noticed, the variable i (let’s call it iterator) was created in ParseLine() and is passed further using ref keyword. This way we can iterate through the whole line and use different functions to parse subsequent parts. It is important to pay attention to value i after each method execution. In our case, each method returns the iterator pointing after delimiter (i.e. comma) or after the end of a string. Consequently, at the beginning of next method, iterator points to the beginning of a new part to parse (or is invalid).

That being said, firstly we need to check if the iterator is valid. If it is, we check character which it points to. If it’s a quotation mark, we encountered an escaped cell (only escaped cells start with quote). Otherwise, it’s a normal cell. In both cases, we call proper methods to parse it.

Parsing normal cell is pretty easy. The only thing we should care about is proper iterator value after method execution:

// returns iterator after delimiter or after end of string
private string ParseNotEscapedCell(string line, ref int i)
{
    StringBuilder sb = new StringBuilder();
    while (true)
    {
        if (i >= line.Length) // return iterator after end of string
            break;
        if (line[i] == m_config.Delimiter)
        {
            i++; // return iterator after delimiter
            break;
        }
        sb.Append(line[i]);
        i++;
    }
    return sb.ToString();
}

Parsing escaped cell is more complicated:

// returns iterator after delimiter or after end of string
private string ParseEscapedCell(string line, ref int i)
{
    i++; // omit first character (quotation mark)
    StringBuilder sb = new StringBuilder();
    while (true)
    {
        if (i >= line.Length)
            break;
        if (line[i] == m_config.QuotationMark)
        {
            i++; // we're more interested in the next character
            if (i >= line.Length)
            {
                // quotation mark was closing cell;
                // return iterator after end of string
                break;
            }
            if (line[i] == m_config.Delimiter)
            {
                // quotation mark was closing cell;
                // return iterator after delimiter
                i++;
                break;
            }
            if (line[i] == m_config.QuotationMark)
            {
                // it was doubled (escaped) quotation mark;
                // do nothing -- we've already skipped first quotation mark
            }

        }
        sb.Append(line[i]);
        i++;
    }

    return sb.ToString();
}

Every escaped cell starts with a quote, so first of all, we should omit this character.

Interesting things happen when we hit on a quotation mark again. We immediately jump to the next character (by incrementing iterator). Basically, two cases exist:

  • If it’s delimiter or end of a text, we just reached the end of a cell – we break the loop.
  • Nonetheless, if it’s quotation mark again, we have doubled quote. So user expects that there will be a single quote in a result cell. We don’t break the loop and save one quote instead.

And that’s all. We have fully functional CsvReader. Usage is simple:

var reader = new CsvReader();
string csv = File.ReadAllText("file.csv");
foreach (var row in reader.Read(csv))
{
    // so something with the whole row

    foreach (var cell in row)
    {
        // do something with the cell
    }
}

CSV internationalization problems

As I mentioned before, Excel compatibility requirement caused some serious problems. Our CSV files had to be opened with a double-click (without importing). And, as it turned out, Excel isn’t as good being international. For example:

  • Despite the fact that CSV stands for “comma-separated values”, Excel doesn’t always use a comma for separating values. It depends on your system’s “Regional and Language Options”. As we’re located in Poland, our Excels use… semicolon! And it’s not even related to the system language – I have Windows in English, but CSVs with semicolons.
  • Excel always saves CSV files with ANSI encoding, which means… you never know what. In Western Europe it would be Windows-1252, in Poland: Windows-1250.
  • Excel doesn’t even try to recognize file encoding, it always assumes ANSI. So if you open a UTF-8 file, all you get is mojibake.

I found no solution to solve these problems. We assumed that our CSV files will be opened in Poland most often. Therefore we used a semicolon as a separator and Windows-1250 as an encoding. As a result, opening them in other parts of the world would require import.

But it isn’t the end of bad news. We use CSV files with texts in different languages, so we cannot always use Windows-1250. It just doesn’t support the full set of other languages’ characters. As you see, we’re up the creek – we have to use Windows-1250, but we cannot.

Consequently, we end up with most of our files being encoded in Windows-1250 and some of them in different encodings. As I stated earlier, Excel doesn’t get by such situation, but I made our tools be able to do so. Every file has a special header containing information about used encoding, delimiter etc. (basically it’s our Config class, somehow serialized and saved in the first line). Thanks to that we’re able to parse CSVs without errors – at least in our scripts. In Excel, it isn’t always possible.

Summary

When I was getting down to this task, it looked like something quite easy. CSV format is very simple after all, so what could possibly go wrong? Well, everything.

First of all, Google had no solution for me. Only simple scripts I found didn’t support  comma-in-cell (which was unacceptable). Besides, there were some big libraries, which not only handle CSV files but also do many other things. And most of these things were completely useless for us.

So I have written my own reader and writer, addressing some of the problems, but not all. Full Excel compatibility with opening files by double-click was accomplished only partially. Nevertheless, I hope that provided code snippets and described experiences will be useful for you.

By the way, it is the very first article on our blog. How do you like it? Let us know in the comments!

Uncategorized

1 Comment Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: