CSV as an import format

To this point in my programming career I absolutely LOATH Comma Seperated Value files.

Most formidably because they have never been standardized. Which is really absolute nonsense.

Let me first say why you should even consider CSV import and exporting in the first place.

Good stuff

  • It is very data efficient. It only has one character for separators and nothing else to dilute and bloat your export files.
  • It can be read very quickly, even if it is a couple of a hundred megabytes big, because every line is a row.
  • It is adopted by many application in the market, both office and data management.

Bad stuff

UTF-8

Nothing protects you from using odd character sets as it is free game. It is a prevalent problem in other formats, but especially here because:

BOM characters

Their are special characters you can include inside a CSV that will totally screw over this application and that. Making it react radically differently when you try to use it.

Non stadarization

It doesn’t even hold true to it’s name, as most CSV’s I have seen use semicolon ; instead of the comma ,

With this it really makes it hard to standardize the token that separates columns. But even with that, some find it necessary to not escape a column, but just ‘skip it’ instead.

Null values appearing as an empty space between 2 semicolons ;;
Or blatantly NULL instead of others being cordoned off with double quotes “”

Which is also not mandatory, which you have to take in to account also when you write an importer.

To top it all off; most do not even have the courtesy to actually ESCAPE double quotes used within the columns sometimes. Which makes it impossible to see where a column ends.

Newline

Which is also stupidly done in my opinion. A lot of people know that Windows uses carriage return /r before the newline /n, which doesn’t make a lot of sense. It makes the least sense in CSV because a newline is used to signify a new row.

Why use 2 characters? Why allow this overhead?

Just use /n and be done with it.

Nail and coffin

If not for it’s efficiency and wide adoption I would wish this format GONE. Away from the earth.

If this standard was EVER to be properly standardized, you could auto generate an importer for nearly every programming language and easily go from any database or application to another. It’s no rocket science, but it becomes that, because it is so rubbishly used all over.

My recommendation

Force UTF-8

Or skip that and use UTF-16 or UTF-32 immediately.

But make it consistent!

All use semicolon

; Is less used then the comma. Thus leaves less room for error.

Omit ALL values with double quote

Is a no-brainer. It doesn’t matter what the actual value means, just put it around that so it is consistent. It’s all text anyway, why bother trying to decide if it should go between quotes or not.

If we can come to common grounds that a values might actually be NULL. Then we could consider that being the only exception to the rule.

Sanitation for double quote

Escape any double quotes that are part of your values. ALWAYS.

No screwing around

Header names might go on the top, without semicolons for ommition and should not contain the semicolon.
They are column or property names, they should not need those.

So the first character of your file should either be a letter (for a header) or a double quote (if it only contains values).

Final thought

You might consider stating the obvious a waste of breath (or finger movement), but I want to start somewhere. This has to change at some point and hopefully for the better. Their are a lot of promising standards out their like XML and JSON that all work very well also. But for very large chunks of data, a lot of people will keep preferring CSV.

So until we find something better, we all want to consider abolishing the things that make some pull their hair out and eat it at 10 O’clock in the evening after hours of fiddling with something that should be trivial.

Etherpad lite

Intro

If you are interested in running your own multi user collaboration text editor from a browser, then Etherpad might be the thing for you.

To be more precise: Etherpad lite
http://etherpad.org/

Years ago I used to role play on MUCK servers that had this kind of functionality. Asynchronously editing text, their the focus was more role play oriented. Which led me to find Inifiny note which later got renamed to Gobby.

Gobby does the same as Etherpad does, but runs from a hard coded client instead of a browser.

Background info

Currently I’m in the process of implementing Etherpad Lite for a non profit organization I do technical tid bits for.
As such this post will grow as my understanding of the application, its installation and customizations grow.

Most formerly if you want to give Etherpad a try, without installing it I suggest trying either:

Piratepad
http://piratepad.net/front-page/

Which is run by the Swedish pirate party and is free, but only runs on HTTP.

Titanpad
https://titanpad.com/

Which is run by a non profit organization that holds it up from donations. It supports HTTPS secure connection by default.

Both need a little disclaimer as I write this on 25-06-2015, things might change over time. They might even disappear (which would be a shame, but entirely possible).

First impressions

The first try out installation I have been given to try out by the non profit organization is pretty good. Etherpad is quick and does what is advertised. When called in Firefox it springs up within a second and you are up and editing immediately without a hassle.

The interface is mostly intuitive, it has all the bits you a expect from a simple editor, you see others adding text real time and show up in their respective colors.

Installation instructions, settings and plugins look pretty well also. However, I bumped in to problems as soon as I tried installing it on my Raspberry Pi (which is a Model B+ for who might want to know)

Installing stuff from Apt-Get is no trouble at all and pulling the application from git neither. However, if you start it, you first need to bypass the fact you run it in root.
I do not understand this mechanism, but it is probably to discourage using the application under the root user.
Then after a lengthy startup it start spewing all kinds of errors which do not make a lot sense to me.
I must be doing something wrong, however I do not understand what yet. I have retried several times, but I have given up for the time being. I might retry on my full fledged server, which is a Intel Core Duo, which might have an easier time running it.

Their, this is the first part, I hope to add more soon.

Postfix, DKIM and random failed verifications

While using the ever so popular PHP to send outgoing mail through Postfix while signing them with a DKIM signature (opendkim, DKIMProxy or any other implementation) most of the time this simple setup seemed to work just fine.
However as our `application` grew some emails start to bounce or otherwise be rejected by a DKIM (body) verification error (Gmail, Hotmail etc all use DKIM verification nowadays, which is a good thing).

It took a solid hour to start figuring out what was going on, as it isn’t easy to see what broke the signature. A big clue comes from the fact that most emails verified just fine and most services specify which part fails.

The main culprit turned out to be an ancient limit on the maximum length of one line (separated by \r\n or sometimes denoted as <CR><LF>), as described in the Postfix manual: smtp line length limit

Normally this should not be a problem (one expects the DKIM filter/milter to be the final pass), except that Postfix chooses to enforce this policy after pickup or (non_)smtpd_milters…..
This problem is very easy to trigger, just insert a minified js or css into an email.

Simply adding the following code to your main.cf fixes this problem, so far no mail servers have mysteriously begun hating me.

smtp_line_length_limit = 250000

Something new

I am not much of a blogger in the sense that I post about my everyday things as if I have the need to tell the world at large about every mundane thing I do.

This blog is about technical solutions that are either: good to know, really smart, took a long time to figure out and/or are really obscure.

As such I will might also go in to ramblings about life as a technical centered person and a programmer.
I might even move some of my stories here, as I am looking for an outlet to place them in.

That said I really enjoy this blog post and find it to be very true:
http://www.stilldrinking.org/programming-sucks

I wish you a nice time and hope to share many (in the field war stories) with you.