Simple CSV parser in C# (with comma-in-cell support!)

Posted November 18, 2016 8 Comments by Piotr Korgul

Some time ago I needed a simple solution in pure C# for reading and writing CSV files. Surprisingly, I couldn’t find any premade code in Google and had to write it on my own.

What is CSV and where’s the catch?

CSV stands for comma-separated values. These files are used to store tabular data. For example, if we have:

John Smith	jsmith@example.com	250	15
Mary Bowden	mbowden@example.com	291	7
Edward Miller	emiller@example.com	174	10

…we can save it in CSV file containing:

John Smith,jsmith@example.com,250,15
Mary Bowden,mbowden@example.com,291,7
Edward Miller,emiller@example.com,174,10

Simple as that. Reading these files should be easy, right? One line equals one row and each row can be divided into cells using comma as a separator. Writing – analogously. There are even some code snippets in Google, so you don’t need to implement it by yourself. Job’s done, isn’t it?

Unfortunately not. There’s a catch: what if one of our cells contains a comma? Our simple algorithm would break then, recognizing it as two cells.

And that’s the point where things get tricky. CSV format is not standardized, so there is no standard way of handling these situations. You just cannot make it work in every environment. Fortunately, I had one requirement: my CSV files should be compatible with Excel. And saying “compatible” I mean double-click and there you go, no importing needed. This turned out later to cause some serious problems… But, at this point, I had no worries: my scripts should behave exactly as Excel in this situation.

How does Excel handle CSV files?

After a few iterations of saving CSV files in Excel and opening them in Notepad I observed two things:

Firstly, if a cell contains a comma, Excel adds quotation marks at the beginning and at the end of a cell.
Secondly, if this quoted cell contains quotations marks inside, Excel doubles them. So single quotation mark means “special” character (added by Excel). On the other hand, double quotation mark means “normal” character (added by the user).

To make things clear, let’s take a look at the example:

Normal line	Some value
Line with comma	Some, value
Line with quotes	Some “value”
Line with comma and quotes	Some, “value”
Line with double quotes	Some “”value””

It’s going to be saved as:

Normal line,Some value
Line with comma,"Some, value"
Line with quotes,"Some ""value"""
Line with comma and quotes,"Some, ""value"""
Line with double quotes,"Some """"value"""""

As you see, if a cell contains comma or quotation mark, it gets surrounded by single quotes. Every single quote inside a cell is doubled (that’s why ""value"" changed into """"value""""). Quite easy to implement, so let’s write some code!

Config

First of all, we’re going to create simple Config class to avoid some hardcoding. It turned out to be particularly useful when we encountered internationalization problems. But – more on that later.

public class CsvConfig
{
    public char Delimiter { get; private set; }
    public string NewLineMark { get; private set; }
    public char QuotationMark { get; private set; }

    public CsvConfig(char delimiter, string newLineMark, char quotationMark)
    {
        Delimiter = delimiter;
        NewLineMark = newLineMark;
        QuotationMark = quotationMark;
    }

    // useful configs

    public static CsvConfig Default
    {
        get { return new CsvConfig(',', "\r\n", '\"'); }
    }

    // etc.
}

Writer class

Here’s CsvWriter class:

public class CsvWriter
{
    private CsvConfig m_config;
    private StringBuilder m_csvContents;

    public CsvWriter(CsvConfig config = null)
    {
        if (config == null)
            m_config = CsvConfig.Default;
        else
            m_config = config;

        m_csvContents = new StringBuilder();
    }

    public void AddRow(IEnumerable<string> cells)
    {
        int i = 0;
        foreach (string cell in cells)
        {
            m_csvContents.Append(ParseCell(cell));
            m_csvContents.Append(m_config.Delimiter);

            i++;
        }

        m_csvContents.Length--; // remove last delimiter
        m_csvContents.Append("\r\n");
    }

    private string ParseCell(string cell)
    {
        // cells cannot be multi-line
        cell = cell.Replace("\r", "");
        cell = cell.Replace("\n", "");

        if ( ! NeedsToBeEscaped(cell))
            return cell;

        // double every quotation mark
        cell = cell.Replace(m_config.QuotationMark.ToString(), string.Format("{0}{0}", m_config.QuotationMark));

        // add quotation marks at the beginning and at the end
        cell = m_config.QuotationMark + cell + m_config.QuotationMark;

        return cell;
    }

    private bool NeedsToBeEscaped(string cell)
    {
        if (cell.Contains(m_config.QuotationMark.ToString()))
            return true;

        if (cell.Contains(m_config.Delimiter.ToString()))
            return true;

        return false;
    }

    public string Write()
    {
        return m_csvContents.ToString();
    }
}

The code should be quite easy to understand. CsvWriter has only two public methods: AddRow() and Write(). The latter writes all rows added to CsvWriter into a string. The result is ready to be saved as a CSV file.

The magic happens earlier, in the AddRow() method. Every row is IEnumerable of strings, each one representing one cell. Every cell is parsed and then added to result with delimiter character (i.e. comma).

In our case cell cannot be multi-line, so parsing removes new line characters. It is important because they are used to distinguish different rows. Then CsvWriter checks if the cell needs to be escaped (i.e. contains commas or quotes). If yes, every quote is doubled and the whole cell is decorated with single quotes at the beginning and at the end.

Usage is also straight-forward:

var writer = new CsvWriter();
writer.AddRow(new string[] {"a", "b", "c"});
writer.AddRow(new List<string> {"1", "2", "3"});
string csv = writer.Write();
File.WriteAllText("file.csv", csv);

Reader class

Reader is a bit more sophisticated. Let’s take a look:

public class CsvReader
{
    private CsvConfig m_config;

    public CsvReader(CsvConfig config = null)
    {
        if (config == null)
            m_config = CsvConfig.Default;
        else
            m_config = config;
    }

    public IEnumerable<string[]> Read(string csvFileContents)
    {
        using (StringReader reader = new StringReader(csvFileContents))
        {
            while (true)
            {
                string line = reader.ReadLine();
                if (line == null)
                    break;
                yield return ParseLine(line);
            }
        }
    }

    private string[] ParseLine(string line)
    {
        Stack<string> result = new Stack<string>();

        int i = 0;
        while (true)
        {
            string cell = ParseNextCell(line, ref i);
            if (cell == null)
                break;
            result.Push(cell);
        }

        // remove last elements if they're empty
        while (string.IsNullOrEmpty(result.Peek()))
        {
            result.Pop();
        }

        var resultAsArray = result.ToArray();
        Array.Reverse(resultAsArray);
        return resultAsArray;
    }

    // returns iterator after delimiter or after end of string
    private string ParseNextCell(string line, ref int i)
    {
        if (i >= line.Length)
            return null;

        if (line[i] != m_config.QuotationMark)
            return ParseNotEscapedCell(line, ref i);
        else
            return ParseEscapedCell(line, ref i);
    }

    // returns iterator after delimiter or after end of string
    private string ParseNotEscapedCell(string line, ref int i)
    {
        StringBuilder sb = new StringBuilder();
        while (true)
        {
            if (i >= line.Length) // return iterator after end of string
                break;
            if (line[i] == m_config.Delimiter)
            {
                i++; // return iterator after delimiter
                break;
            }
            sb.Append(line[i]);
            i++;
        }
        return sb.ToString();
    }

    // returns iterator after delimiter or after end of string
    private string ParseEscapedCell(string line, ref int i)
    {
        i++; // omit first character (quotation mark)
        StringBuilder sb = new StringBuilder();
        while (true)
        {
            if (i >= line.Length)
                break;
            if (line[i] == m_config.QuotationMark)
            {
                i++; // we're more interested in the next character
                if (i >= line.Length)
                {
                    // quotation mark was closing cell;
                    // return iterator after end of string
                    break;
                }
                if (line[i] == m_config.Delimiter)
                {
                    // quotation mark was closing cell;
                    // return iterator after delimiter
                    i++;
                    break;
                }
                if (line[i] == m_config.QuotationMark)
                {
                    // it was doubled (escaped) quotation mark;
                    // do nothing -- we've already skipped first quotation mark
                }

            }
            sb.Append(line[i]);
            i++;
        }

        return sb.ToString();
    }
}

This time we have only one public method, Read(). It gets CSV file as a string, reads it line by line (i.e. row by row), and yields each row as an array of strings (i.e. cells).

Take a closer look at ParseLine():

private string[] ParseLine(string line)
{
    Stack<string> result = new Stack<string>();

    int i = 0;
    while (true)
    {
        string cell = ParseNextCell(line, ref i);
        if (cell == null)
            break;
        result.Push(cell);
    }

    // remove last elements if they're empty
    while (string.IsNullOrEmpty(result.Peek()))
    {
        result.Pop();
    }

    var resultAsArray = result.ToArray();
    Array.Reverse(resultAsArray);
    return resultAsArray;
}

ParseLine() uses ParseNextCell() to get subsequent cells and stores them in stack. Why did I choose this data structure? We sometimes had many empty cells at the end of every row in our CSV files. I wanted to have a nice and efficient way to get rid of them without removing empty cells in the middle of the row. Having the stack of cells, it’s enough to pop it as long as Peek() returns empty cell.

Then we convert it to the array – and because we had a stack, items will be stored in a reversed order. Thus one additional step (reversing array) is needed.

The most important part starts with ParseNextCell():

// returns iterator after delimiter or after end of string
private string ParseNextCell(string line, ref int i)
{
    if (i >= line.Length)
        return null;

    if (line[i] != m_config.QuotationMark)
        return ParseNotEscapedCell(line, ref i);
    else
        return ParseEscapedCell(line, ref i);
}

As you probably noticed, the variable i (let’s call it iterator) was created in ParseLine() and is passed further using ref keyword. This way we can iterate through the whole line and use different functions to parse subsequent parts. It is important to pay attention to value i after each method execution. In our case, each method returns the iterator pointing after delimiter (i.e. comma) or after the end of a string. Consequently, at the beginning of next method, iterator points to the beginning of a new part to parse (or is invalid).

That being said, firstly we need to check if the iterator is valid. If it is, we check character which it points to. If it’s a quotation mark, we encountered an escaped cell (only escaped cells start with quote). Otherwise, it’s a normal cell. In both cases, we call proper methods to parse it.

Parsing normal cell is pretty easy. The only thing we should care about is proper iterator value after method execution:

// returns iterator after delimiter or after end of string
private string ParseNotEscapedCell(string line, ref int i)
{
    StringBuilder sb = new StringBuilder();
    while (true)
    {
        if (i >= line.Length) // return iterator after end of string
            break;
        if (line[i] == m_config.Delimiter)
        {
            i++; // return iterator after delimiter
            break;
        }
        sb.Append(line[i]);
        i++;
    }
    return sb.ToString();
}

Parsing escaped cell is more complicated:

// returns iterator after delimiter or after end of string
private string ParseEscapedCell(string line, ref int i)
{
    i++; // omit first character (quotation mark)
    StringBuilder sb = new StringBuilder();
    while (true)
    {
        if (i >= line.Length)
            break;
        if (line[i] == m_config.QuotationMark)
        {
            i++; // we're more interested in the next character
            if (i >= line.Length)
            {
                // quotation mark was closing cell;
                // return iterator after end of string
                break;
            }
            if (line[i] == m_config.Delimiter)
            {
                // quotation mark was closing cell;
                // return iterator after delimiter
                i++;
                break;
            }
            if (line[i] == m_config.QuotationMark)
            {
                // it was doubled (escaped) quotation mark;
                // do nothing -- we've already skipped first quotation mark
            }

        }
        sb.Append(line[i]);
        i++;
    }

    return sb.ToString();
}

Every escaped cell starts with a quote, so first of all, we should omit this character.

Interesting things happen when we hit on a quotation mark again. We immediately jump to the next character (by incrementing iterator). Basically, two cases exist:

If it’s delimiter or end of a text, we just reached the end of a cell – we break the loop.
Nonetheless, if it’s quotation mark again, we have doubled quote. So user expects that there will be a single quote in a result cell. We don’t break the loop and save one quote instead.

And that’s all. We have fully functional CsvReader. Usage is simple:

var reader = new CsvReader();
string csv = File.ReadAllText("file.csv");
foreach (var row in reader.Read(csv))
{
    // so something with the whole row

    foreach (var cell in row)
    {
        // do something with the cell
    }
}

CSV internationalization problems

As I mentioned before, Excel compatibility requirement caused some serious problems. Our CSV files had to be opened with a double-click (without importing). And, as it turned out, Excel isn’t as good being international. For example:

Despite the fact that CSV stands for “comma-separated values”, Excel doesn’t always use a comma for separating values. It depends on your system’s “Regional and Language Options”. As we’re located in Poland, our Excels use… semicolon! And it’s not even related to the system language – I have Windows in English, but CSVs with semicolons.
Excel always saves CSV files with ANSI encoding, which means… you never know what. In Western Europe it would be Windows-1252, in Poland: Windows-1250.
Excel doesn’t even try to recognize file encoding, it always assumes ANSI. So if you open a UTF-8 file, all you get is mojibake.

I found no solution to solve these problems. We assumed that our CSV files will be opened in Poland most often. Therefore we used a semicolon as a separator and Windows-1250 as an encoding. As a result, opening them in other parts of the world would require import.

But it isn’t the end of bad news. We use CSV files with texts in different languages, so we cannot always use Windows-1250. It just doesn’t support the full set of other languages’ characters. As you see, we’re up the creek – we have to use Windows-1250, but we cannot.

Consequently, we end up with most of our files being encoded in Windows-1250 and some of them in different encodings. As I stated earlier, Excel doesn’t get by such situation, but I made our tools be able to do so. Every file has a special header containing information about used encoding, delimiter etc. (basically it’s our Config class, somehow serialized and saved in the first line). Thanks to that we’re able to parse CSVs without errors – at least in our scripts. In Excel, it isn’t always possible.

Summary

When I was getting down to this task, it looked like something quite easy. CSV format is very simple after all, so what could possibly go wrong? Well, everything.

First of all, Google had no solution for me. Only simple scripts I found didn’t support comma-in-cell (which was unacceptable). Besides, there were some big libraries, which not only handle CSV files but also do many other things. And most of these things were completely useless for us.

So I have written my own reader and writer, addressing some of the problems, but not all. Full Excel compatibility with opening files by double-click was accomplished only partially. Nevertheless, I hope that provided code snippets and described experiences will be useful for you.

By the way, it is the very first article on our blog. How do you like it? Let us know in the comments!

Uncategorized

Piotr Korgul View All →

8 Comments Leave a comment ›

M.Elnaqeeb
September 19, 2017 at 2:11 am Reply

this should become first result from Google for csv c#

LikeLike
Srujan
May 9, 2019 at 1:36 am Reply

Man!! You are a genius. Elnaqeeb is right, this should come as first result in search.
Thank you so much, you saved my day. Otherwise I would have spent weeks writing this.

LikeLike
David Brown
February 7, 2021 at 5:00 pm Reply

The CSV desert now has an oasis. I spent an entire morning looking for this. Thank you so much for posting.

LikeLike
Steve Ryckman
December 17, 2024 at 7:17 pm Reply

Found an issue where if a line in the import file has something like: His name was “bob” that the reader drops the first quote from in front of bob and if a quote is at the end of a field, it attaches the next field.

Sample line: “TestLine”,”Field2″,”His name was “Bob””,”Field After”

Becomes: His name was Bob”,Field After

LikeLike
- growlofowl
  December 18, 2024 at 7:51 am Reply
  
  Hi Steve! Thank you for posting. As far as I know double quotation marks are used to escape a single quotation mark in CSV and it seems that this is the case in yout example, i.e. the correctly formed CSV should either be:
  “His name was “”Bob”””
  or
  His name was “Bob”
  
  LikeLike
  - Steve Ryckman
    December 18, 2024 at 9:42 pm
    
    Correctly formed CSV being the key – which I find that many vendors don’t do. In three different CSV files I was working with today, Keepass, LastPass and DashLane all did it differently. I don’t remember which was which but these are the variations:
    
    “His name was “Bob”” while another did it as “His name was “””Bob”””” and yet another did it in the format “His name was “”Bob””” besides the original version I saw of “His name was “Bob”” with no extra quotes.
    
    LikeLike
Musketir
July 5, 2026 at 9:23 am Reply

what is te purpose of this?

if(line[i] == m_config.QuotationMark)

            {

                // it was doubled (escaped) quotation mark;

                // do nothing -- we've already skipped first quotation mark

            }

You test against something and then disregard the result. Lucky for you the compiler is smart enought ot remove it from exe.

LikeLike
- growlofowl
  July 6, 2026 at 8:40 am Reply
  
  Thanks for the comment! Pls note that there’s i++ instruction earlier, so it references a different character.
  
  LikeLike