Next: Single Character Fields, Previous: Default Field Splitting, Up: Field Separators [Contents][Index]
The previous subsection
discussed the use of single characters or simple strings as the
value of FS.
More generally, the value of FS may be a string containing any
regular expression.  In this case, each match in the record for the regular
expression separates fields.  For example, the assignment:
FS = ", \t"
makes every area of an input line that consists of a comma followed by a space and a TAB into a field separator.
For a less trivial example of a regular expression, try using
single spaces to separate fields the way single commas are used.
FS can be set to "[ ]" (left bracket, space, right
bracket).  This regular expression matches a single space and nothing else
(see section Regular Expressions).
There is an important difference between the two cases of ‘FS = " "’
(a single space) and ‘FS = "[ \t\n]+"’
(a regular expression matching one or more spaces, TABs, or newlines).
For both values of FS, fields are separated by runs
(multiple adjacent occurrences) of spaces, TABs,
and/or newlines.  However, when the value of FS is " ",
awk first strips leading and trailing whitespace from
the record and then decides where the fields are.
For example, the following pipeline prints ‘b’:
$ echo ' a b c d ' | awk '{ print $2 }'
-| b
However, this pipeline prints ‘a’ (note the extra spaces around each letter):
$ echo ' a  b  c  d ' | awk 'BEGIN { FS = "[ \t\n]+" }
>                                  { print $2 }'
-| a
In this case, the first field is null, or empty.
The stripping of leading and trailing whitespace also comes into
play whenever $0 is recomputed.  For instance, study this pipeline:
$ echo '   a b c d' | awk '{ print; $2 = $2; print }'
-|    a b c d
-| a b c d
The first print statement prints the record as it was read,
with leading whitespace intact.  The assignment to $2 rebuilds
$0 by concatenating $1 through $NF together,
separated by the value of OFS (which is a space by default).
Because the leading whitespace was ignored when finding $1,
it is not part of the new $0.  Finally, the last print
statement prints the new $0.
There is an additional subtlety to be aware of when using regular expressions
for field splitting.
It is not well specified in the POSIX standard, or anywhere else, what ‘^’
means when splitting fields.  Does the ‘^’  match only at the beginning of
the entire record? Or is each field separator a new string?  It turns out that
different awk versions answer this question differently, and you
should not rely on any specific behavior in your programs.
(d.c.)
As a point of information, BWK awk allows ‘^’
to match only at the beginning of the record. gawk
also works this way. For example:
$ echo 'xxAA  xxBxx  C' |
> gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
>                             printf "-->%s<--\n", $i }'
-| --><--
-| -->AA<--
-| -->xxBxx<--
-| -->C<--
Next: Single Character Fields, Previous: Default Field Splitting, Up: Field Separators [Contents][Index]