Requirements
Searching your site’s access logs can be difficult! There’s a lot of information to visually take in, which means viewing the entire log entry can sometimes be overwhelming. In this article we’ll discuss ways to parse your access logs to better understand trends on your site!
Before we get started, let’s talk prerequisites. In order to use the bash commands we’ll discuss, you’ll need:
- Apache and/or Nginx
- SSH access to your server
Using grep, cut, sort, uniq, and awk
Before we dive into the logs themselves, let’s talk about how we’ll be parsing them. First, you’ll want to understand the bash commands we use to collect specific pieces of information from the logs.
cat
Used to “concatenate” two or more files together. In the case of searching access logs, we’d typically use it to print out results from two files together, to search for information in both. For example, if I wanted to print or search both today’s and yesterday’s logs.
zcat
Works the same as cat, but concatenates and prints out results of gzip (.gz) compressed files specifically.
grep
Use to search for a specific string or quote in a single file.
ack-grep
Used to find strings in a file or several files. The ack-grep command is more efficient than grep when looking for results in a directory or multiple files. And, unlike a standard grep, ack-grep prints out the specific line in which the text string was found. It’s also easier to only search in files of a specific kind, like PHP files for example.
cut
Used to cut out specific pieces of information, useful for sorting through grep results. Using the -d flag you can set a “delimiter” (dividing signal). And using the -f flag you can choose which field(s) separated by said delimiter to print out specifically.
sort
Sorts output results from lowest to highest, useful for sorting grep and cut results. You can use sort -rn to show results from highest to lowest instead.
uniq
Filters out repeating lines, most often used as uniq -c to get a count of unique results per entry. The uniq command filters out repeating lines by combining any repeated entries into a single result, to only show “unique” entries.
awk
A text processor for Unix – typically used as a sorting mechanism by using awk -F (-F meaning “Field separator”).
head
Says to print the top 10 lines of the specified file(s). You can adjust the number of lines by adding the -n flag to the end, n meaning any number.
tail
Says to print the last 10 lines of the specified file(s). Similar to head, you can adjust the number of lines by adding the -n flag to the end, where n means any number. You can also use tail to live monitor access logs using tail -f logname.log.
find
Used to find files by a particular name, date, or extension type. You can use the flags -d to find directories, or -f to search files.
Apache access logs
So how do we put all the information above together? Let’s start by looking at the Apache access logs. Apache is the most commonly-used web server, so it’s a good place to start.
Before you get started, be sure to locate your access logs. Depending on the version of Linux you’re running, it could be in a slightly different location. On my server I’m using Ubuntu so my logs are found in the /var/log/apache2/ directory. Here’s an example of some of my Apache access logs for a reference point:
104.54.207.40 techgirlkb.guru - [18/Aug/2017:21:52:08 +0000] "GET /wp-content/uploads/2017/08/lynx_server_status_2.png HTTP/1.1" 200 223989 "http://techgirlkb.guru/2017/08/troubleshooting-high-server-load/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" 104.54.207.40 techgirlkb.guru - [18/Aug/2017:21:52:08 +0000] "GET /wp-content/uploads/2017/08/sar-q-output-768x643.png HTTP/1.1" 200 357286 "http://techgirlkb.guru/2017/08/troubleshooting-high-server-load/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" 104.54.207.40 techgirlkb.guru - [18/Aug/2017:21:52:08 +0000] "GET /wp-content/cache/autoptimize/js/autoptimize_a60c72b796d777751fdd13d6f0375f9c.js HTTP/1.1" 200 49995 "http://techgirlkb.guru/2017/08/troubleshooting-high-server-load/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" 104.54.207.40 techgirlkb.guru - [18/Aug/2017:21:52:09 +0000] "GET /wp-content/uploads/2017/08/htop_mysql_usage.png HTTP/1.1" 200 456536 "http://techgirlkb.guru/2017/08/troubleshooting-high-server-load/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" 104.54.207.40 techgirlkb.guru - [18/Aug/2017:21:52:09 +0000] "GET /wp-content/uploads/2017/08/mytop.jpg HTTP/1.1" 200 417613 "http://techgirlkb.guru/2017/08/troubleshooting-high-server-load/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
Totals of HTTP response codes
cut -d' ' -f9 techgirlkb.access.log | sort | uniq -c | sort -rn
This command says, using space (‘ ‘) as a delimiter, print the 9th field from techgirlkb.access.log. Then sort says to arrange the results lowest to highest, uniq -c says to combine all the same response codes into one entry and get a count, and then the sort -rn says to then sort your results from highest to lowest count. You should get an output like this:
1663 200 78 499 58 302 33 404 29 301 18 444 12 206 3 304 2 403 1 416
Hits to Apache per hour
awk -F'[' '{print $2 }' techgirlkb.access.log | awk -F: '{ print "Time: " $1,$2 ":00" }' | sort | uniq -c | more
This command says, using ‘[‘ as a delimiter, print the 2nd field from my Apache access log. Then with that output, using ‘:’ as a delimiter print the word “Time :” followed by the 1st and 2nd fields followed by “:00”. This allows you to group the requests in the hour together and get a count. You should get an output that looks like this:
293 Time: 18/Aug/2017 00:00 78 Time: 18/Aug/2017 01:00 188 Time: 18/Aug/2017 02:00 79 Time: 18/Aug/2017 03:00 27 Time: 18/Aug/2017 04:00 14 Time: 18/Aug/2017 05:00 40 Time: 18/Aug/2017 06:00 4 Time: 18/Aug/2017 07:00 74 Time: 18/Aug/2017 08:00
Top requests to Apache for today and yesterday
cat techgirlkb.access.log techgirlkb.access.log.1 | cut -d' ' -f7 | sort | uniq -c | sort -rn | head -20
This command says to concatenate today and yesterday’s access logs. Then, using space as the delimiter, print the 7th field. With that output, sort from lowest to highest, group like entries together, sort them by the count and then print the top 20 results. Your output should look something like this:
202 /wp-admin/admin-ajax.php 59 /feed/ 49 /2017/08/troubleshooting-high-server-load/ 37 /2017/08/the-anatomy-of-a-ddos/ 33 /
Top IPs to hit the site today and yesterday
cat techgirlkb.access.log techgirlkb.access.log.1 | cut -d' ' -f1 | sort | uniq -c | sort -rn | head -25
This command says to concatenate the logs for today and yesterday. Then, with the output using space as a delimiter, print the 1st field. Sort and group your results together to get the list of the top 25 IPs. Here’s an example output:
343 104.54.207.40 341 104.196.38.166 114 66.162.212.19 75 38.140.212.19 56 173.212.242.97 46 5.9.106.230 45 173.212.203.245
Top User-Agents to hit the site today and yesterday
cat techgirlkb.access.log techgirlkb.access.log.1 | cut -d'"' -f6 | sort | uniq -c | sort -rn | head -20
This command says to concatenate today’s and yesterday’s logs. Then with the output, using double quotes (“) as the delimiter, print the 6th field. Sort and combine the results, printing the top 20 unique User-Agent strings.
628 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 339 WordPress/4.8.1; http://techgirlkb.guru 230 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36 230 Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/) 209 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Top HTTP referrers to the site today
cut -d'"' -f4 techgirlkb.access.log | sort | uniq -c | sort -rn | head -20
This command says to print the 4th field (using ” as the delimiter), sort and combine into unique entries and get a count, then print the top 20 results. Here’s what your output should look like:
714 - 278 http://techgirlkb.guru/2017/08/the-anatomy-of-a-ddos/ 151 http://techgirlkb.guru/wp-admin/post-new.php 100 http://techgirlkb.guru/wp-admin/options-general.php?page=NextScripts_SNAP.php 87 http://techgirlkb.guru/wp-admin/
Nginx access logs
If you use Nginx access logs, you’ll need to adjust your commands to match the log format. Here’s what my Nginx logs look like, for an example to gather data:
18/Aug/2017:22:31:41 +0000|v1|104.196.38.166|techgirlkb.guru|499|0|127.0.0.1:6789|-|0.996|POST /wp-cron.php?doing_wp_cron=1503095500.1473810672760009765625 HTTP/1.1 18/Aug/2017:22:31:42 +0000|v1|104.54.207.40|techgirlkb.guru|302|4|127.0.0.1:6788|2.941|3.109|POST /wp-admin/post.php HTTP/1.1 18/Aug/2017:22:31:43 +0000|v1|104.54.207.40|techgirlkb.guru|200|60419|127.0.0.1:6788|0.870|0.870|GET /wp-admin/post.php?post=182&action=edit&message=10 HTTP/1.1 18/Aug/2017:22:31:44 +0000|v1|104.54.207.40|techgirlkb.guru|200|1321|-|-|0.000|GET /wp-content/plugins/autoptimize/classes/static/toolbar.css?ver=1503095503 HTTP/1.1 18/Aug/2017:22:31:44 +0000|v1|104.54.207.40|techgirlkb.guru|404|564|-|-|0.000|GET /wp-content/plugins/above-the-fold-optimization/admin/css/admincp-global.min.css?ver=2.7.12 HTTP/1.1
Notice that in these access logs, most fields are separated by a pipe (|). We also have some additional information in these logs, like how long the request took and what port it was routed to. We can use this information to look at even more contextual data. Usually your Nginx access logs can be found in /var/log/nginx/.
Totals of HTTP response codes today
cut -d'|' -f5 techgirlkb.access.log | sort | uniq -c | sort -rn
This command says to print the 5th field using pipe (|) as a delimiter. Then, sort and combine the results into unique entries by count, and show us the results sorted highest to lowest frequency. Your output should look something like this:
1751 200 113 499 59 302 34 404 29 301 18 444 12 206 3 304 2 403 1 416
Top requests to Nginx for today and yesterday
cut -d' ' -f3 techgirlkb.access.log | sort | uniq -c | sort -rn | head -20
This command says to use space as a delimiter, print the 3rd field. Then sort and combine the results, and print the top 20 results. Your output should look like this:
142 /wp-admin/admin-ajax.php 79 /wp-content/uploads/2017/08/staircase-600468_1280.jpg 54 /2017/08/the-anatomy-of-a-ddos/ 53 / 41 /?url=https%3A//www.google-analytics.com/analytics.js&type=js&abtf-proxy=fd0703a0b1e757ef151b57e8dec02b32 32 /wp-includes/js/wp-emoji-release.min.js?ver=4.8.1 30 /robots.txt 30 /feed/
Show requests that took over 60 seconds to complete today
awk -F\| '{ if ($9 >= 60) print $0 }' techgirlkb.access.log
This command says, using pipe (|) as a delimiter, print the entire line only if the 9th field is greater than or equal to 60. The 9th field shows the response time in our Nginx logs. Your output should look something like this:
18/Aug/2017:00:30:45 +0000|v1|104.54.207.40|techgirlkb.guru|200|643|127.0.0.1:6788|63.400|67.321|POST /wp-admin/async-upload.php HTTP/1.1 18/Aug/2017:00:30:49 +0000|v1|104.54.207.40|techgirlkb.guru|200|644|127.0.0.1:6788|62.343|63.828|POST /wp-admin/async-upload.php HTTP/1.1 18/Aug/2017:00:30:56 +0000|v1|104.54.207.40|techgirlkb.guru|200|642|127.0.0.1:6788|61.613|64.402|POST /wp-admin/async-upload.php HTTP/1.1
Bash Tips and Tricks
When creating commands for searching your logs, there are a few best practices to keep in mind. We’ll cover some tips that apply to searching access logs below.
cat filename | grep pattern – never cat a single file and pipe to a grep. The cat command is made to concatenate the contents of multiple files. It’s much more efficient to use a format like: grep “pattern” filename
expr and seq – don’t use outdated scripting methods. Some examples would be expr and seq. For seq, counting is now built into bash, so it’s no longer needed. And expr is inefficient in that is written for outdated shell systems. In these older systems, expr starts a process that calls other programs to do the requested work. In general, it’s best practice to use newer and more efficient methods when making scripts.
ls -l | awk ‘{ print $8 }’ – If you’re searching for a filename within a directory, never parse the output of ls! The ls command is not consistent in its output across various platforms, meaning running it in one environment may offer far different results from another environment. And since some file names can contain a new line, this causes weird output in some cases as well.
For more best practices when writing scripts, check out this Bash FAQ.
Happy log crunching! I hope these tips and quick commands help you understand the output of your access logs, and help you build your own scripts to do the same. Have any other quick one-liners you use to search your logs? Let me know in the comments, or contact me.
Michael Heredia says
Wooo good stuff.