CSV as an import format

To this point in my programming career I absolutely LOATH Comma Seperated Value files.

Most formidably because they have never been standardized. Which is really absolute nonsense.

Let me first say why you should even consider CSV import and exporting in the first place.

Good stuff

  • It is very data efficient. It only has one character for separators and nothing else to dilute and bloat your export files.
  • It can be read very quickly, even if it is a couple of a hundred megabytes big, because every line is a row.
  • It is adopted by many application in the market, both office and data management.

Bad stuff

UTF-8

Nothing protects you from using odd character sets as it is free game. It is a prevalent problem in other formats, but especially here because:

BOM characters

Their are special characters you can include inside a CSV that will totally screw over this application and that. Making it react radically differently when you try to use it.

Non stadarization

It doesn’t even hold true to it’s name, as most CSV’s I have seen use semicolon ; instead of the comma ,

With this it really makes it hard to standardize the token that separates columns. But even with that, some find it necessary to not escape a column, but just ‘skip it’ instead.

Null values appearing as an empty space between 2 semicolons ;;
Or blatantly NULL instead of others being cordoned off with double quotes “”

Which is also not mandatory, which you have to take in to account also when you write an importer.

To top it all off; most do not even have the courtesy to actually ESCAPE double quotes used within the columns sometimes. Which makes it impossible to see where a column ends.

Newline

Which is also stupidly done in my opinion. A lot of people know that Windows uses carriage return /r before the newline /n, which doesn’t make a lot of sense. It makes the least sense in CSV because a newline is used to signify a new row.

Why use 2 characters? Why allow this overhead?

Just use /n and be done with it.

Nail and coffin

If not for it’s efficiency and wide adoption I would wish this format GONE. Away from the earth.

If this standard was EVER to be properly standardized, you could auto generate an importer for nearly every programming language and easily go from any database or application to another. It’s no rocket science, but it becomes that, because it is so rubbishly used all over.

My recommendation

Force UTF-8

Or skip that and use UTF-16 or UTF-32 immediately.

But make it consistent!

All use semicolon

; Is less used then the comma. Thus leaves less room for error.

Omit ALL values with double quote

Is a no-brainer. It doesn’t matter what the actual value means, just put it around that so it is consistent. It’s all text anyway, why bother trying to decide if it should go between quotes or not.

If we can come to common grounds that a values might actually be NULL. Then we could consider that being the only exception to the rule.

Sanitation for double quote

Escape any double quotes that are part of your values. ALWAYS.

No screwing around

Header names might go on the top, without semicolons for ommition and should not contain the semicolon.
They are column or property names, they should not need those.

So the first character of your file should either be a letter (for a header) or a double quote (if it only contains values).

Final thought

You might consider stating the obvious a waste of breath (or finger movement), but I want to start somewhere. This has to change at some point and hopefully for the better. Their are a lot of promising standards out their like XML and JSON that all work very well also. But for very large chunks of data, a lot of people will keep preferring CSV.

So until we find something better, we all want to consider abolishing the things that make some pull their hair out and eat it at 10 O’clock in the evening after hours of fiddling with something that should be trivial.