Why is there no decent file format for tabular data?

Posted on

Tabular data is everywhere. I support reading and writing tabular data in various formats in all 3 software applications. It’s an important part of my data transformation software. But all tabular data formats suck. There doesn’t seem to be anything that’s reasonably space efficient, simple and quick to parse, and text-based (non-binary) so you can view and edit it with a standard editor.

Most tabular data is currently exchanged as: CSV, tab separated, XML, JSON or Excel. And they’re all very suboptimal for the job.

CSV is a mess. A citation in the wrong place and the file is invalid. It’s hard to scan efficiently using multiple cores, because of the quotes (you can’t start scanning from half a file). Different scoring schemes are used. Not sure what encoding it is in. The use of separators and line endings is inconsistent (sometimes comma, sometimes semicolon). Writing a parser to handle all the different dialects is not at all trivial. Microsoft Excel and Apple Numbers don’t even agree on how to interpret some edge cases for CSV.

Separate Tab is a bit better than CSV. But can’t store tabs and still has issues with line endings, encodings, etc.

XML and JSON are tree structures and not suitable for storing tabular data efficiently (along with other issues).

There are hardwood floors. It is very efficient with its columnar storage and compression. But it’s binary, so it can’t be viewed or edited with standard tools, which is a pain.

Don’t even get me started on Excel’s proprietary and horrible binary format.

Why can’t we have a format where:

  • Encoding is always UTF-8
  • Values ​​stored in primary row order (row 1, row 2, etc.)
  • Columns are separated by u001F (ASCII unit separator)
  • Lines are separated by u001E (ASCII record separator)
  • Uh, that’s the whole spec.

No escape. If you want to put u001F or u001E in your data, you can’t. Use a different format.

It would be reasonably compact, efficient to parse, and easy to edit manually (Notepad++ displays the unit separator as a “US” symbol). You can write a quick parser for this in minutes. Typing u001F or u001E in some editors can be a hassle, but it’s not a hindrance.

It might be called something like “unicode-separated value” (hat tip for @fakeunicode on Twitter for the name) or “unit-separated value” with the .usv file extension. Perhaps a different extension could be used when values ​​are stored in primary column order (column1, column2, etc.).

Is there already no such thing? Maybe there is and I just haven’t heard of it. If not, shouldn’t there be?

And yes, I know the relevant XKCD cartoon ( https://xkcd.com/927/ ).

** Modify May 4, 2022 **

“Javascript” -> “JSON” in paragraph 5.

It was pointed out that the above will give you a single line of text in an editor, which is not ideal for human readability. A quick fix for this would be to make the record delimiter a u001E character followed by an LF character. Any LF that comes immediately after a u001E would be ignored during parsing. Any LF not immediately after a u001E is part of the data. I don’t know about other editors, but it’s easy to view and edit in Notepad++.

Leave a Reply

Your email address will not be published.