awk: field-oriented processing of structured text

Why awk

When text is laid out in columns (logs, df, ps, CSV), awk splits it into fields and gives you the usual numeric operations. [[cmd-sed|sed]] is strong line by line; awk is strong at fields plus counters plus aggregation. The line between awk and Python: up to 30 lines of awk script or a table of 5 columns, and awk wins.

On Linux you usually have gawk (GNU awk) or mawk (faster, more minimal). The POSIX subset works everywhere. GNU extensions (gensub, multi-dim arrays via ;, asort) are gawk only.

Basic syntax

awk 'pattern { action }' file

awk -F: '{ print $1 }' /etc/passwd

awk -f script.awk file

pattern is a regex /foo/, a numeric condition $3 > 100, BEGIN, END, or a combination with &&/||
action is a { ... } block. A pattern with no action defaults to { print }. An action with no pattern runs on every line.

Fields and built-in variables

Variable	Meaning
`$0`	the whole line
`$1`, `$2`, ... `$NF`	fields 1..N
`NF`	number of fields in the current line
`NR`	number of the current line (overall counter)
`FNR`	line number within the current file (for multi-file)
`FS`	field separator on read (default: spaces/tabs)
`OFS`	separator when printing with `print a,b,c`
`RS`	record separator (default `\n`)
`ORS`	record separator on output
`FILENAME`	name of the current file

bash

# Who is logged in and which shell

awk -F: '{ print $1, $7 }' /etc/passwd

# Top 10 IPs by request count in access.log

awk '{ print $1 }' access.log | sort | uniq -c | sort -rn | head

# Sum of file sizes

ls -l | awk '{ sum += $5 } END { print sum/1024/1024 " MiB" }'

BEGIN and END

BEGIN { ... } runs before the first line is read. Common uses: setting FS, initializing variables, a report header.
END { ... } runs after the last line. The final summary.

bash

awk 'BEGIN { FS=":"; print "user\tshell" }

     { print $1 "\t" $7 }

     END { print "total:", NR }' /etc/passwd

Conditions and arithmetic

bash

# 5xx requests from nginx

awk '$9 >= 500 && $9 < 600' access.log

# Failed SSH logins for today

awk -v d="$(date +%b\ %d)" '$0 ~ d && /Failed password/' /var/log/auth.log

# Convert bytes to MiB in `ls -l`

awk '{ printf "%-30s %.2f MiB\n", $9, $5/1024/1024 }' <(ls -l)

Awk is strictly typed between number and string by context: $1 + 0 forces a number, $1 "" forces a string.

Associative arrays

This is the main thing awk does more easily than the shell:

bash

# Top 5 IP addresses by 5xx count

awk '$9 >= 500 { count[$1]++ }

     END { for (ip in count) print count[ip], ip }' access.log \

  | sort -rn | head -5

# Sum of bytes per user-agent

awk -F'"' '{ ua=$6; bytes[ua] += $NF } END { for (k in bytes) print bytes[k], k }' access.log

A simple report

awk

# report.awk

BEGIN { FS=","; OFS="\t"; print "host", "errors", "avg_ms" }

{ errs[$1] += $2; ms[$1] += $3; cnt[$1]++ }

END   { for (h in errs) print h, errs[h], ms[h]/cnt[h] }

To run it:

bash

awk -f report.awk metrics.csv | sort -k2 -rn

awk vs sed vs jq

Task	Tool
Replace a pattern in a line	cmd-sed
One or two columns plus a filter	awk
Aggregation by key	awk
JSON	[[cmd-jq
Multiline structures	Python
XML	XSLT / Python

When something goes wrong

$10 does not work means that in awk $10 is the tenth field, not the first plus "0". But in print $1 0 it is concatenation. Parentheses fix it: print $1, $1+10.
Fields shifted because of spaces inside values happens because the default FS = [ \t]+ is greedy. Use -F'\t' for strict TSV, or awk -F'"' ... for CSV-with-quotes (but honest CSV parsing belongs to Python).
A pattern with \d matches nothing because POSIX awk does not know PCRE. \d does not work; write [0-9].
gawk vs mawk: gensub, asort, and the third arg in match() are GNU only. If a script breaks on Alpine (which ships mawk), check this.
stdin input got muddled: awk '{...}' < file is fine, and cat file | awk is fine too, but awk < file '{...}' is a syntax error.

Why awk

On Linux you usually have gawk (GNU awk) or mawk (faster, more minimal). The POSIX subset works everywhere. GNU extensions (gensub, multi-dim arrays via ;, asort) are gawk only.

Basic syntax

awk 'pattern { action }' file

awk -F: '{ print $1 }' /etc/passwd

awk -f script.awk file

pattern is a regex /foo/, a numeric condition $3 > 100, BEGIN, END, or a combination with &&/||
action is a { ... } block. A pattern with no action defaults to { print }. An action with no pattern runs on every line.

Fields and built-in variables

Variable	Meaning
`$0`	the whole line
`$1`, `$2`, ... `$NF`	fields 1..N
`NF`	number of fields in the current line
`NR`	number of the current line (overall counter)
`FNR`	line number within the current file (for multi-file)
`FS`	field separator on read (default: spaces/tabs)
`OFS`	separator when printing with `print a,b,c`
`RS`	record separator (default `\n`)
`ORS`	record separator on output
`FILENAME`	name of the current file

bash

# Who is logged in and which shell

awk -F: '{ print $1, $7 }' /etc/passwd

# Top 10 IPs by request count in access.log

awk '{ print $1 }' access.log | sort | uniq -c | sort -rn | head

# Sum of file sizes

ls -l | awk '{ sum += $5 } END { print sum/1024/1024 " MiB" }'

BEGIN and END

BEGIN { ... } runs before the first line is read. Common uses: setting FS, initializing variables, a report header.
END { ... } runs after the last line. The final summary.

bash

awk 'BEGIN { FS=":"; print "user\tshell" }

     { print $1 "\t" $7 }

     END { print "total:", NR }' /etc/passwd

Conditions and arithmetic

bash

# 5xx requests from nginx

awk '$9 >= 500 && $9 < 600' access.log

# Failed SSH logins for today

awk -v d="$(date +%b\ %d)" '$0 ~ d && /Failed password/' /var/log/auth.log

# Convert bytes to MiB in `ls -l`

awk '{ printf "%-30s %.2f MiB\n", $9, $5/1024/1024 }' <(ls -l)

Awk is strictly typed between number and string by context: $1 + 0 forces a number, $1 "" forces a string.

Associative arrays

This is the main thing awk does more easily than the shell:

bash

# Top 5 IP addresses by 5xx count

awk '$9 >= 500 { count[$1]++ }

     END { for (ip in count) print count[ip], ip }' access.log \

  | sort -rn | head -5

# Sum of bytes per user-agent

awk -F'"' '{ ua=$6; bytes[ua] += $NF } END { for (k in bytes) print bytes[k], k }' access.log

A simple report

awk

# report.awk

BEGIN { FS=","; OFS="\t"; print "host", "errors", "avg_ms" }

{ errs[$1] += $2; ms[$1] += $3; cnt[$1]++ }

END   { for (h in errs) print h, errs[h], ms[h]/cnt[h] }

To run it:

bash

awk -f report.awk metrics.csv | sort -k2 -rn

awk vs sed vs jq

Task	Tool
Replace a pattern in a line	cmd-sed
One or two columns plus a filter	awk
Aggregation by key	awk
JSON	[[cmd-jq
Multiline structures	Python
XML	XSLT / Python

When something goes wrong

$10 does not work means that in awk $10 is the tenth field, not the first plus "0". But in print $1 0 it is concatenation. Parentheses fix it: print $1, $1+10.
Fields shifted because of spaces inside values happens because the default FS = [ \t]+ is greedy. Use -F'\t' for strict TSV, or awk -F'"' ... for CSV-with-quotes (but honest CSV parsing belongs to Python).
A pattern with \d matches nothing because POSIX awk does not know PCRE. \d does not work; write [0-9].
gawk vs mawk: gensub, asort, and the third arg in match() are GNU only. If a script breaks on Alpine (which ships mawk), check this.
stdin input got muddled: awk '{...}' < file is fine, and cat file | awk is fine too, but awk < file '{...}' is a syntax error.

awk: field-oriented processing of structured text

Why awk

Basic syntax

Fields and built-in variables

BEGIN and END

Conditions and arithmetic

Associative arrays

A simple report

awk vs sed vs jq

When something goes wrong

§ команды

§ см. также

awk: field-oriented processing of structured text

Why awk

Basic syntax

Fields and built-in variables

BEGIN and END

Conditions and arithmetic

Associative arrays

A simple report

awk vs sed vs jq

When something goes wrong

§ команды

§ см. также