|
|
PGTS Humble BlogThread: Tips/Tricks For Programming etc |
|
Gerry Patterson. The world's most humble blogger | |
Edited and endorsed by PGTS, Home of the world's most humble blogger | |
| |
CR/LF LF Linefeeds Again And Again |
|
Chronogical Blog Entries: |
|
| |
Date: Tue, 25 Aug 2009 00:31:43 +1000One task that never seems to go away is the old Windows CR/LF conversion problem. The apocryphal tale about this is that it was due to a basic misunderstanding of the difference between DCE and DTE back in the days of CP/M ... Whatever, this little annoyance will probably be with us forever. Here are some simple tips that help deal with it. |
There are number of commands that will do this:
-
dos2unix/unix2dos: These are available on Linux, FreeBSD, cygwin and many "nix"s. They are also known by the names fromdos and todos. In fact they are all just logical links to the same program. These commands do more or less what the names imply. Use "man" to read about the extra options. It is is the easiest and quickest way to convert files:
unix2dos file.txt # convert file.txt unix to dos (over-write original)
dos2unix -a /foo/bar/*.txt # remove all "\r" from the specified files -
Other unix commands: However not all systems have the handy tofrodos commands. Almost all systems have sed and/or tr however. And most of them have awk and perl:
awk '{print $0"\r"}' unix.txt > dos.txt
sed -e "s/$/\r/" unix.txt > dos.txt
perl -pne 's/\n/\r\n/' unix.txt > dos.txt
tr -d "\r" < dos.txt > unix.txt
cat dos.txt | tr -d "\r" > unix.txt
perl -pne 's/\r//g' dos.txt > unix.txt # this works like "dos2unix -a"
sed -e 's/\r//g' dos.txt > unix.txt # same as previous
perl -pne 's/\r//' dos.txt > unix.txt # removes the first occurence of "\r"
sed -e 's/\r//' dos.txt > unix.txt # same as previous
sed -e 's/\r$//' dos.txt > unix.txt # this works like dos2unix (without -a) -
Also, Gvim, vim and vi can eliminate carriage returns (and other control characters) with the substitute command. This is handy if the file is malformed. e.g. if most of the lines end with CR/LF but a few don't. In such cases Gvim (or vim) will load the file as a regular file and display all the carriage returns as the control character ^M. Gvim and vim, depending on how they are configured, usually show control characters in a different color. Stock standard vi doesn't have colours or dos filemode, and will always display carriage returns as ^M. You can get rid of the carriage returns (and other control characters) by using the Ctrl-V quote feature in vi or vim. The command will appear as follows:
:%s/^M//
Where ^M character is created with the the two keyboard presses of Ctrl-V Ctrl-M. For more details about the Ctrl-V quote command see the note below -
Windows commands: However, you may be stuck in Windows, without cygwin. If so, your best option is to install Activestate perl. Failing that the type command will cope with files terminated with a single "\n":
perl -npe "" unix.txt > dos.txt
perl -ne "print" unix.txt > dos.txt
type unix.txt | find /V "" > dos.txt
The standard Activestate perl distribution comes configured for Windows.
So the CR/LF is done automatically when you write to STDOUT. That's why the
command with an empty expression (above) will work in Windows. NB: to stop
this behaviour, include this line in your perl scripts:
binmode(STDOUT);
Alternatively the program Gvim for Windows is very powerful and capable of loading fairly large files (depending on how much memory your workstation has). In this case files can be converted by using the :set filetype=dos command.
Note: As mentioned above vi (and vim) can replace control characters using the the substitute command combined with the Ctrl-V quote technique. This is simply a matter of pressing Ctrl-V followed by the control character combination that you wish to quote. The (ASCII) control characters combinations are as follows:
Key | Hex | Code | Description |
^A | 0x01 | SOH | Start Of Heading |
^B | 0x02 | STX | Start Of Text |
^C | 0x03 | ETX | End Of Text |
^D | 0x04 | EOT | End Of Transmission |
^E | 0x05 | ENQ | Enquire |
^F | 0x06 | ACK | Acknowledge |
^G | 0x07 | BEL | Bell |
^H | 0x08 | BS | Backspace |
^I | 0x09 | TAB | Tab |
^J | 0x00 | NUL | Null |
^K | 0x0B | VT | Vertical Tab |
^L | 0x0C | FF | Form feed |
^M | 0x0D | CR | Carriage return |
^N | 0x0E | SO | Shift Out |
^O | 0x0F | SI | Shift In |
^P | 0x10 | DLE | Data Link Escape |
^R | 0x12 | DC2 | Device Control 2 |
^T | 0x14 | DC4 | Device Control 4 |
^U | 0x15 | NAK | Negative Acknowledge |
^V | 0x16 | SYN | Syncronous Idle |
^W | 0x17 | ETB | End Of Transmission Block |
^X | 0x18 | CAN | Cancel |
^Y | 0x19 | EM | End Of Medium |
^Z | 0x1A | SUB | Substitute |
^[ | 0x1B | ESC | Escape |
^\ | 0x1C | FS | File Separator |
^] | 0x1D | GS | Group Separator |
^^ | 0x1E | RS | Record Separator |
^_ | 0x1F | US | Unit Separator |
This all sorta works ... The control characters are mapped to the various letters of the alphabet starting with the letter A. Those of you who actually remember using a teleprinter terminal will appreciate that this mapping makes a weird mnemonic type of "sense" (if you know your Hex). And the missing characters make sense also. Ctrl-Q and Ctrl-S traditionally control output to the screen and don't appear in the table above. And of course in "vi" you can't "quote" Ctrl-J since it maps to the record terminator ("\n"), and therfore Ctrl-J becomes the "null" character. BTW If you press Ctrl-J while using the console it will map to the line-feed terminator. In fact all the control characters map to the corresponding non-control characters masked with 0x1F - e.g. Ctrl-A is 0x41 & 0x1F, Ctrl-B is 0x42 & 0x1F, ... etc.