Sunday, June 11, 2017

Posted by beni in , , , , , , , , , , , , , | June 11, 2017

A little exercise about a recent forum question input field handling in awk and Perl


Just recently the following question was posted to the UNIX scripting group in Linkedin:
remove all duplicate entries in a colon separated list of strings
e.g. a: b:c: b: d:a:a: b:e::f should be transformed to a: b:c: d:e::f

Some of the fields contain spaces which should be preserved in the output of course, there is an empty field too which (to me and other authors) indicates that the fields are not necessarily ordered. Here I wont discuss the suggested solutions, I also did not answer to the original posting because I read it one month too late.

awk
But when reading the question my brain already got working and I could not help to try for myself. The obvious tool of choice for exercises like this is awk because awk has inbuilt mechanisms for viewing lines as a sequence of fields with configurable field separators.

A solution could be

BEGIN {
  FS=":" ;    # field separator
  ORS=":"     # output record separator
}
{ for(i=1;i<=NF;i++) {  # for all input fields
    if( f[$i] ) {       # check if array entry for field already exists
     continue;          # if yes: go to next field
    } else {
     print $i;          # if no: print the field content
     f[$i] = 1;          # and record it in array f
    }  }
}

which leads to this output:
a: b:c: d:e::f:

The script can be shortened by omitting superfluous braces and else to

BEGIN { FS=":" ; ORS=":" } 
{ for(i=1;i<=NF;i++) { if(f[$i]) continue; f[$i]=1; print $i; } } 

The script uses a very simple straightforward logic: loop through all input fields, if a field is new then print it, if not skip it. This is achieved by storing each field in an associated array f when it first occurs.
Using the field separator FS for splitting the input line and the output record separator ORS when printing (you need to know that print automatically adds ORS) makes this an easy task.

There is one issue though: this solution adds an extra colon at the very end (compared to the requested output), this could be an issue or not depending on the context of this request so one might prefer this code:

BEGIN { FS=":" } 
{ printf $1; f[$1]=1; 
  for(i=2;i<=NF;i++) { if(f[$i]) continue; f[$i]=1; printf FS $i } }

which uses a slightly different logic: the first field is printed straight away (and recorded), the loop checks the remaining fields 2..NF and prints the field separator as a prefix to the field content. This code also works for the extreme case where there is just one field and no colon.

Perl
I then wondered if this couldnt be done equivalently or even shorter in Perl but my best solution is a little bit lengthier because I have to use split to get the individual fields.

$FS=":";
@s = split($FS,<>);
for($i=0;$i<=$#s;$i++) {$e=$s[$i]; next if(exists($f{$e})); $f{$e}=1; print $e,$FS }


I could have used command line options "-a -F:" to avoid the split but I need FS to be defined anyway for the output (I dont know if the split pattern defined by -F can be accessed in Perl).
I use split to chop up the input line and put it into an array s. Then the same logic applies as in awk. Instead of an associative array Im using a hash table f in Perl. The variable e is only used to avoid repeated occurances of $s[$i]. In the end tits a matter of personal preference which solution you take.


It should be noted that I tested with

echo -n "...." | awk ... or perl -e ...

which feeds a string without newline to the pipe which helped to avoid chomp in Perl for removing the newline in the last field.

Search