Previous: awk split records, Up: Records [Contents][Index]
gawkWhen using gawk, the value of RS is not limited to a
one-character string.  If it contains more than one character, it is
treated as a regular expression
(see section Regular Expressions). (c.e.)
In general, each record
ends at the next string that matches the regular expression; the next
record starts at the end of the matching string.  This general rule is
actually at work in the usual case, where RS contains just a
newline: a record ends at the beginning of the next matching string (the
next newline in the input), and the following record starts just after
the end of this string (at the first character of the following line).
The newline, because it matches RS, is not part of either record.
When RS is a single character, RT
contains the same single character. However, when RS is a
regular expression, RT contains
the actual input text that matched the regular expression.
If the input file ends without any text matching RS,
gawk sets RT to the null string.
The following example illustrates both of these features.
It sets RS equal to a regular expression that
matches either a newline or a series of one or more uppercase letters
with optional leading and/or trailing whitespace:
$ echo record 1 AAAA record 2 BBBB record 3 |
> gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
>             { print "Record =", $0,"and RT = [" RT "]" }'
-| Record = record 1 and RT = [ AAAA ] -| Record = record 2 and RT = [ BBBB ] -| Record = record 3 and RT = [ -| ]
The square brackets delineate the contents of RT, letting you
see the leading and trailing whitespace. The final value of
RT is a newline.
See section A Simple Stream Editor for a more useful example
of RS as a regexp and RT.
If you set RS to a regular expression that allows optional
trailing text, such as ‘RS = "abc(XYZ)?"’, it is possible, due
to implementation constraints, that gawk may match the leading
part of the regular expression, but not the trailing part, particularly
if the input text that could match the trailing part is fairly long.
gawk attempts to avoid this problem, but currently, there’s
no guarantee that this will never happen.
NOTE: Remember that in
awk, the ‘^’ and ‘$’ anchor metacharacters match the beginning and end of a string, and not the beginning and end of a line. As a result, something like ‘RS = "^[[:upper:]]"’ can only match at the beginning of a file. This is becausegawkviews the input file as one long string that happens to contain newline characters. It is thus best to avoid anchor metacharacters in the value ofRS.
The use of RS as a regular expression and the RT
variable are gawk extensions; they are not available in
compatibility mode
(see section Command-Line Options).
In compatibility mode, only the first character of the value of
RS determines the end of the record.
| RS = "\0"Is Not PortableThere are times when you might want to treat an entire data file as a
single record.  The only way to make this happen is to give  You might think that for text files, the NUL character, which
consists of a character with all bits equal to zero, is a good
value to use for  BEGIN { RS = "\0" }  # whole file becomes one record?
 Almost all other  It happens that recent versions of  See section Reading a Whole File at Once for an interesting way to read
whole files.  If you are using  | 
Previous: awk split records, Up: Records [Contents][Index]