looking for a tool to conver line endings

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

IGarcia
Posts: 543
Joined: Mon Jul 05, 2010 10:27 pm

Re: looking for a tool to conver line endings

Post by IGarcia »

Sven Schüle wrote:
rvida wrote:I decided to use the sed based solution. It is a wonderful tool,
Hi Richard,

take care of doing both steps (CR/LF->LF conversion + trailing whitespace removal) separately, starting with the CR/LF part. Depending on the platform where you perform these conversions, "sed" as well as "perl" or other tools using regular expressions may or may not recognize a CR/LF character sequence as something that matches a "$" (end of input line) in the given pattern. Therefore a pattern logically resembling "<whitespace><whitespace>*$" may or may not match an input line that ends with <whitespace><CR><LF>. You can expect it to succeed in a typical Windows-like environment where CR/LF is the typical text file line ending, but not in a typical UNIX environment. Furthermore, also combining "<whitespace><whitespace>*<CR><LF>" in one pattern will not always succeed since line endings could be inconsistent within one file.

The idea of $ matching end of line is to add portability. Will match end of line at run time OS independent. So your program dealing with some data before an end of line (data)$ will always find the data even if you run your script in a different operating system.

The main problem is when you write a regular expression using $ ( end of line match) when you really want to match only one character (CR or LF). Then the code will probably fail in other OS.

In this case, the problem proposed by Vida, you are looking for specific combination of space, tabs, CR and LF. Here its ok to not use $.

Your post, wich is valid and importatnt to be aware of this details, make me think the command I posted will not work if the input has mixed data (some lines with CR+LF, other only LF). This is solved making optional the CR match. So the version 3 ( :evil: ) is now:

Code: Select all

perl -pe 's/\s*\r*\n/\n/'  in > out 
The code will not work for mac text input, because still request at least one \n. The most accurate is the perl script posted by Horacio.

Sven wrote: Hmmm ... wasn't it an invention from the DOS world to have that CR/LF line ending that created one of the biggest (in)compatibility issues in the whole IT world? :-)

Sven
For a real printer the DOS solution is the more logical and gives more control, because you have the option to move carriage, returning to column 0 and optional feeding, This allow to overwrite by returning carriage without feeding. A nonsense for a real printer because overwrites all, but useful if you write to screen and you don't like to scroll.


Ignacio
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: looking for a tool to conver line endings

Post by bob »

IGarcia wrote:
bob wrote:
rvida wrote:Hi,

I am looking for a simple tool that can

1) convert Windows (CR+LF) line endings to Unix (LF)
2) trim trailing whitespace at the end of each line
Linux has always had dos2unix and unix2dos commands. That what you want?
those programs are not installed by default and the other problem is not solved: you still have spaces before end-line.

@Horacio: Nice coding. :wink:
My command misses tabs.. This will do the trik form command line

Code: Select all

perl -pe 's/\s*\r\n/\n/' infile > outfile
They are installed by default on my fedora installations...

I use them regularly.
nepossiver
Posts: 38
Joined: Wed Sep 03, 2008 4:12 am

Re: looking for a tool to conver line endings

Post by nepossiver »

Sven wrote:Furthermore, also combining "<whitespace><whitespace>*<CR><LF>" in one pattern will not always succeed since line endings could be inconsistent within one file.
Indeed, and it is the reason I use one replace for LF and another for CR, as I've came across (more created myself, with OS jumping) files with mixed newlines, and chomp produced unexpected results on these files.
IGarcia wrote: The code will not work for mac text input, because still request at least one \n. The most accurate is the perl script posted by Horacio.
Maybe not so accurate, I think I overdid it:

Code: Select all

$_ =~ s/\r//; 
$_ =~ s/\n//; 
are not necessary, as

Code: Select all

$_ =~ s/\s+$//;

will remove already LF and CR. (I've learned this from Perl faq, a great resource.) EDIT: By the way, by mac, I mean old mac OSes, MacOsX is unix (FreeBSD?) based and uses LF.
Sven wrote: Hmmm ... wasn't it an invention from the DOS world to have that CR/LF line ending that created one of the biggest (in)compatibility issues in the whole IT world? :-)
Reading the wikipedia page on newline, one gets the impression each early OS had its version of newline. DOS (and other OSes) kept the typewriter logic, but unix wanted to save disk space. Maybe it is unfair to blame Microsoft for this one incompatibility.
Last edited by nepossiver on Mon Mar 19, 2012 4:51 am, edited 1 time in total.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: looking for a tool to conver line endings

Post by bob »

IGarcia wrote:
Sven Schüle wrote:
rvida wrote:I decided to use the sed based solution. It is a wonderful tool,
Hi Richard,

take care of doing both steps (CR/LF->LF conversion + trailing whitespace removal) separately, starting with the CR/LF part. Depending on the platform where you perform these conversions, "sed" as well as "perl" or other tools using regular expressions may or may not recognize a CR/LF character sequence as something that matches a "$" (end of input line) in the given pattern. Therefore a pattern logically resembling "<whitespace><whitespace>*$" may or may not match an input line that ends with <whitespace><CR><LF>. You can expect it to succeed in a typical Windows-like environment where CR/LF is the typical text file line ending, but not in a typical UNIX environment. Furthermore, also combining "<whitespace><whitespace>*<CR><LF>" in one pattern will not always succeed since line endings could be inconsistent within one file.

The idea of $ matching end of line is to add portability. Will match end of line at run time OS independent. So your program dealing with some data before an end of line (data)$ will always find the data even if you run your script in a different operating system.

The main problem is when you write a regular expression using $ ( end of line match) when you really want to match only one character (CR or LF). Then the code will probably fail in other OS.

In this case, the problem proposed by Vida, you are looking for specific combination of space, tabs, CR and LF. Here its ok to not use $.

Your post, wich is valid and importatnt to be aware of this details, make me think the command I posted will not work if the input has mixed data (some lines with CR+LF, other only LF). This is solved making optional the CR match. So the version 3 ( :evil: ) is now:

Code: Select all

perl -pe 's/\s*\r*\n/\n/'  in > out 
The code will not work for mac text input, because still request at least one \n. The most accurate is the perl script posted by Horacio.

Sven wrote: Hmmm ... wasn't it an invention from the DOS world to have that CR/LF line ending that created one of the biggest (in)compatibility issues in the whole IT world? :-)

Sven
For a real printer the DOS solution is the more logical and gives more control, because you have the option to move carriage, returning to column 0 and optional feeding, This allow to overwrite by returning carriage without feeding. A nonsense for a real printer because overwrites all, but useful if you write to screen and you don't like to scroll.


Ignacio
The CR/LF IS the end-of-line sentinel (at least in Dos/Windows text files, in unix a single LF (0x0a character) marks end of line in unix and most other systems). There are no "lines" in a file at all, just a continuous stream of bytes. The system simply interprets some character (or pair of characters in dos/windows) as an end-of-line marker. One has to be careful with "$" because some will interpret $ to be simply the N/L character. Searching for spaces while the C/R character is there will cause it to not match.

It is a real headache, to say the least...
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: looking for a tool to conver line endings

Post by Sven »

nepossiver wrote:
Sven wrote: Hmmm ... wasn't it an invention from the DOS world to have that CR/LF line ending that created one of the biggest (in)compatibility issues in the whole IT world? :-)
Reading the wikipedia page on newline, one gets the impression each early OS had its version of newline. DOS (and other OSes) kept the typewriter logic, but unix wanted to save disk space. Maybe it is unfair to blame Microsoft for this one incompatibility.
O.k., Wikipedia writes that MS-DOS adopted CP/M's CR+LF "in order to be compatible" (to what, may I ask?), so you have a point when stating that it was not an MS-DOS invention. But that was in 1981 where UNIX existed for more than 10 years already, and UNIX was based on Multics.

To avoid any upcoming "CR+LF" war, I'll quit here and simply say that you are right :-)

Sven
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: looking for a tool to conver line endings

Post by Sven »

IGarcia wrote:
Sven Schüle wrote:
rvida wrote:I decided to use the sed based solution. It is a wonderful tool,
Hi Richard,

take care of doing both steps (CR/LF->LF conversion + trailing whitespace removal) separately, starting with the CR/LF part. Depending on the platform where you perform these conversions, "sed" as well as "perl" or other tools using regular expressions may or may not recognize a CR/LF character sequence as something that matches a "$" (end of input line) in the given pattern. Therefore a pattern logically resembling "<whitespace><whitespace>*$" may or may not match an input line that ends with <whitespace><CR><LF>. You can expect it to succeed in a typical Windows-like environment where CR/LF is the typical text file line ending, but not in a typical UNIX environment. Furthermore, also combining "<whitespace><whitespace>*<CR><LF>" in one pattern will not always succeed since line endings could be inconsistent within one file.
The idea of $ matching end of line is to add portability. Will match end of line at run time OS independent. So your program dealing with some data before an end of line (data)$ will always find the data even if you run your script in a different operating system.
Please reread what I wrote:

- With "sed" on Windows, a "$" will match both "<CR><LF>" and "<LF>" line endings. (But surprisingly "sed" will write its output without the <CR> in the standard case, thus failing to preserve the "native" line ending mode of the OS it is running upon.)

- With "sed" on UNIX, a "$" will only match "<LF>" line endings, and will fail to replace anything when using a pattern like " *$" and the input lines have "<CR><LF>" endings.

That is the reason why an "sed" script using "$" is NOT portable for the task to remove trailing whitespace characters, it only works for input files with the "native" line endings. And that is further the reason why a portable solution (that runs on both UNIX and WinXX platforms without changes) does TWO steps: 1. CRLF=>LF, 2. remove trailing whitespace.

EDIT: Or uses "\r\n", which also works today (but didn't in some older "sed" versions I used).

Perl is different but not really much better for that purpose, in my opinion.

Sven
IGarcia
Posts: 543
Joined: Mon Jul 05, 2010 10:27 pm

Re: looking for a tool to conver line endings

Post by IGarcia »

Sven Schüle wrote:
IGarcia wrote:
Sven Schüle wrote:
rvida wrote:I decided to use the sed based solution. It is a wonderful tool,
Hi Richard,

take care of doing both steps (CR/LF->LF conversion + trailing whitespace removal) separately, starting with the CR/LF part. Depending on the platform where you perform these conversions, "sed" as well as "perl" or other tools using regular expressions may or may not recognize a CR/LF character sequence as something that matches a "$" (end of input line) in the given pattern. Therefore a pattern logically resembling "<whitespace><whitespace>*$" may or may not match an input line that ends with <whitespace><CR><LF>. You can expect it to succeed in a typical Windows-like environment where CR/LF is the typical text file line ending, but not in a typical UNIX environment. Furthermore, also combining "<whitespace><whitespace>*<CR><LF>" in one pattern will not always succeed since line endings could be inconsistent within one file.
The idea of $ matching end of line is to add portability. Will match end of line at run time OS independent. So your program dealing with some data before an end of line (data)$ will always find the data even if you run your script in a different operating system.
Please reread what I wrote:

- With "sed" on Windows, a "$" will match both "<CR><LF>" and "<LF>" line endings. (But surprisingly "sed" will write its output without the <CR> in the standard case, thus failing to preserve the "native" line ending mode of the OS it is running upon.)

- With "sed" on UNIX, a "$" will only match "<LF>" line endings, and will fail to replace anything when using a pattern like " *$" and the input lines have "<CR><LF>" endings.

That is the reason why an "sed" script using "$" is NOT portable for the task to remove trailing whitespace characters, it only works for input files with the "native" line endings. And that is further the reason why a portable solution (that runs on both UNIX and WinXX platforms without changes) does TWO steps: 1. CRLF=>LF, 2. remove trailing whitespace.

EDIT: Or uses "\r\n", which also works today (but didn't in some older "sed" versions I used).

Perl is different but not really much better for that purpose, in my opinion.

Sven
Sorry. I did not try to correct you. Only to say what is the idea about "$" and why sometimes fails, not only with sed.

Regards.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: looking for a tool to conver line endings

Post by Don »

rvida wrote:Hi,

I am looking for a simple tool that can

1) convert Windows (CR+LF) line endings to Unix (LF)
2) trim trailing whitespace at the end of each line
With your talent I think you would be much happier as a Unix person.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: looking for a tool to conver line endings

Post by lucasart »

rvida wrote:Thanks for all the answers.

I decided to use the sed based solution. It is a wonderful tool, although for people coming from Dos/Windows world the syntax is somewhat obscure...

Btw. after some googling I found a list of very useful sed one liners:
http://sed.sourceforge.net/sed1line.txt
Yes, it's true that regular expressions aren't easy at first. But once you understand how they work, they are so powerful. It's amazing the amount of stuff one can do with grep and sed.
Perl is even more powerful and can do it all, but again one has to learn how to use it. Anyway, whether sed, perl, awk(?), many paths lead to Rome :D
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: looking for a tool to conver line endings

Post by Don »

lucasart wrote:
rvida wrote:Thanks for all the answers.

I decided to use the sed based solution. It is a wonderful tool, although for people coming from Dos/Windows world the syntax is somewhat obscure...

Btw. after some googling I found a list of very useful sed one liners:
http://sed.sourceforge.net/sed1line.txt
Yes, it's true that regular expressions aren't easy at first. But once you understand how they work, they are so powerful. It's amazing the amount of stuff one can do with grep and sed.
Perl is even more powerful and can do it all, but again one has to learn how to use it. Anyway, whether sed, perl, awk(?), many paths lead to Rome :D
I suggest to Richard that you could just learn Perl and be done with it. Perl has powerful regular expressions, is a full scripting language and is just about as expressive as any language can be. You could write one-liners on the command line (without getting out the editor) with the -e switch which will do amazing things. And when a one-liner is not quite enough Perl let's you do amazingly powerful stuff in just a few lines of code.

Your problem for example is solved like this using perl:

perl -e 'while (<>) { s/\s+$//; print $_."\n"; }' <win.txt > unix.txt

This assumes you run this script from unix. It will take all the white space out of the end of each line and then re-print it with the unix "\n" line ending.

I believe a way that would work on both windows and unix is this:

perl -e 'while (<>) { s/\s+$//; print $_ . chr(10); }' <win.txt > unix.txt

Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.