Feeding the Line: Carriage Returns Matter

It all started as a request as easy as any other requests: please take this file and use it to upload data to the database.

Sure, no problem. First, open the Excel sheet ... clean it up a bit ... save as a CSV ... and then I have it in the form that this handy little Perl script can upload to the database.

But when I run the script, which is handling zip codes for context, all the zip codes are 00000. Not only is this not a valid zip code, even if it were the script basically would have taken several hundred unique zip codes and their location data and made them all into 00000. 🤔

The first thing that I notice on opening the file is that all the rows are actually on a single line and are separated by ^M in vim.

So what's ^M and how do I fix it?

This is where we circle back around to our title about carriage returns and line feeds. A little relevant info bites to help us get started:

  • A carriage return (CR) means that the cursor moves to the beginning of the line, this is denoted by \r.
  • A line feed (LF) means moving to the next line, this is denoted by \n.

And a few more bits of information:

  • Windows uses \r\n as its End of Line sequence.
  • Mac uses \r as its EOL sequence.
  • Unix uess \n as its EOL sequence.

You can see the trouble brewing now, can't you? 🔮

So when I used Excel on a Mac to convert the file to a CSV file, it was using \r as the newline character. When I uploaded that to the server of interest, running CentOS, it showed it as being one long line because I was now in Linux not macOS. So when I supplied the CSV file to the Perl script, a script that wasn't written to handle all three scenarios, it nope'd right outta there.

But that leads us to our purpose:

  1. How do I fix this?
  2. Why ^M?

For the first, in vim I can do a global replacement on the new line mish-mash with:

:%s/\r/\r/g

Breaking this down \r in vim is finding any existing new line character, in this case all the ^Ms, and replacing them with the OS's new line character, in this case \n. That solves our new line issue. %s applies this change to all lines of the file and g applies the change to all instances on each line. Without the g only the first instance of ^M will be changed per line. (In this single line file, that means it'll only be replaced once.)

Now for the ^M. If you pull up an ASCII chart, you'll find that the line feed character, \n, is 0xA (or 0x0A); whereas the carriage return character, \r, is 0xD (or 0x0D). The reason vim displayed the carriage returns as ^M is because D is 13 in hexidecimal (0-9, then A-F for 10-15) and the 13th letter of the English alphabet is ... 🥁 ... M.

But that's not your only problem

After feeling all happy that I fixed my new lines, I found another problem. Because there's always more than one 😉

When the file was being read in by Perl every line was prefixed with: \x{feff}. This is a zero width no break space.

Invisicharacters are the bane of my existance today, it seems. A quick way to fix this is:

perl -CSD -pe 's/^\x{feff}//' ${FILENAME}.csv

Since I had a few files with this problem I just wrapped this into a single line for loop in BASH:

for FILE in *csv; do TMPZIP=$(mktemp zip-XXXXX) && perl -CSD -pe 's/^\x{feff}//' ${FILE} > ${TMPZIP} && cp ${TMPZIP} ${FILE} && rm ${TMPZIP}; done

A quick explanation of the script:

  • FILE in *csv is a foreach, so the loop will perform this action on all CSV files with the .csv extension
  • I used mktemp to make a temp file to prevent the unlikely event where me mashing up a common temporary file name like tmp will overwrite a tmp file that I actually needed / wanted. See man mktemp for more about the command.
  • the perl ... line is reading in the file, replacing the feff hex character with no character (removing it), and writing the output to a new file. This is because the perl line doesn't modify the file itself, it just prints it to screen (stdout).
  • I copy the TMPZIP file to overwrite the existing FILE.
  • I remove the TMPZIP file.

Note that if you mktemp zip-XXXXX.csv you'll be making a CSV file, which will be pulled into your for loop and will wreak some havoc on what you were hoping would be an easy, clean fix. To see how this works, create some faux CSV files or just some backups of real files, and run the for loop with a temp file that has the CSV extension.