Why awk
When text is laid out in columns (logs, df, ps, CSV), awk splits it
into fields and gives you the usual numeric operations. [[cmd-sed|sed]] is
strong line by line; awk is strong at fields plus counters plus
aggregation. The line between awk and Python: up to 30 lines of awk script
or a table of 5 columns, and awk wins.
On Linux you usually have gawk (GNU awk) or mawk (faster, more
minimal). The POSIX subset works everywhere. GNU extensions (gensub,
multi-dim arrays via ;, asort) are gawk only.
Basic syntax
awk 'pattern { action }' fileawk -F: '{ print $1 }' /etc/passwdawk -f script.awk file
- pattern is a regex
/foo/, a numeric condition$3 > 100,BEGIN,END, or a combination with&&/|| - action is a
{ ... }block. A pattern with no action defaults to{ print }. An action with no pattern runs on every line.
Fields and built-in variables
| Variable | Meaning |
|---|---|
$0 | the whole line |
$1, $2, ... $NF | fields 1..N |
NF | number of fields in the current line |
NR | number of the current line (overall counter) |
FNR | line number within the current file (for multi-file) |
FS | field separator on read (default: spaces/tabs) |
OFS | separator when printing with print a,b,c |
RS | record separator (default \n) |
ORS | record separator on output |
FILENAME | name of the current file |
# Who is logged in and which shell
awk -F: '{ print $1, $7 }' /etc/passwd# Top 10 IPs by request count in access.log
awk '{ print $1 }' access.log | sort | uniq -c | sort -rn | head# Sum of file sizes
ls -l | awk '{ sum += $5 } END { print sum/1024/1024 " MiB" }'BEGIN and END
BEGIN { ... }runs before the first line is read. Common uses: settingFS, initializing variables, a report header.END { ... }runs after the last line. The final summary.
awk 'BEGIN { FS=":"; print "user\tshell" } { print $1 "\t" $7 } END { print "total:", NR }' /etc/passwdConditions and arithmetic
# 5xx requests from nginx
awk '$9 >= 500 && $9 < 600' access.log
# Failed SSH logins for today
awk -v d="$(date +%b\ %d)" '$0 ~ d && /Failed password/' /var/log/auth.log
# Convert bytes to MiB in `ls -l`
awk '{ printf "%-30s %.2f MiB\n", $9, $5/1024/1024 }' <(ls -l)Awk is strictly typed between number and string by context: $1 + 0
forces a number, $1 "" forces a string.
Associative arrays
This is the main thing awk does more easily than the shell:
# Top 5 IP addresses by 5xx count
awk '$9 >= 500 { count[$1]++ } END { for (ip in count) print count[ip], ip }' access.log \| sort -rn | head -5
# Sum of bytes per user-agent
awk -F'"' '{ ua=$6; bytes[ua] += $NF } END { for (k in bytes) print bytes[k], k }' access.logA simple report
# report.awk
BEGIN { FS=","; OFS="\t"; print "host", "errors", "avg_ms" }{ errs[$1] += $2; ms[$1] += $3; cnt[$1]++ }END { for (h in errs) print h, errs[h], ms[h]/cnt[h] }To run it:
awk -f report.awk metrics.csv | sort -k2 -rn
awk vs sed vs jq
| Task | Tool |
|---|---|
| Replace a pattern in a line | cmd-sed |
| One or two columns plus a filter | awk |
| Aggregation by key | awk |
| JSON | [[cmd-jq |
| Multiline structures | Python |
| XML | XSLT / Python |
When something goes wrong
$10does not work means that in awk$10is the tenth field, not the first plus "0". But inprint $1 0it is concatenation. Parentheses fix it:print $1, $1+10.- Fields shifted because of spaces inside values happens because the
default FS =
[ \t]+is greedy. Use-F'\t'for strict TSV, orawk -F'"' ...for CSV-with-quotes (but honest CSV parsing belongs to Python). - A pattern with
\dmatches nothing because POSIX awk does not know PCRE.\ddoes not work; write[0-9]. - gawk vs mawk:
gensub,asort, and the third arg inmatch()are GNU only. If a script breaks on Alpine (which ships mawk), check this. - stdin input got muddled:
awk '{...}' < fileis fine, andcat file | awkis fine too, butawk < file '{...}'is a syntax error.