Log Files: Dealing with Inconsistent Field Delimeter


Log files are big. Processing  it  would be cumbersome especially if the field separator are not so unique.

Take a look at contain of file example.log below :

"",6667,"Rembau, NSembilan,Malaysia","GET /phpmyadmin ",404
"",80,"Selangor","GET /phpmyadmin/ ,200
"",9090,"A. Star, Kedah, Malysia","GET /phpmyadmin/favicon.ico,404
"",6667,"Malysia","GET /phpmyadmin/print.css,404
"",993,"A. Star, Kedah, Malysia","GET /phpmyadmin/phpmyadmin.css.php?lang=en-utf-8,404

At first sight, anybody would agree to use ‘ as field separator. But hey, the third field contain that same character.

If we insist to choose (‘) as our separator, the field number will not be consistent through out the file.
Line 1 would have 7 field, line 2 have 5 field etc.

If the task is to print ip number and the file requested, how should we do that?

Luckily gawk have special keyword, NF, means number of field.
To print just first and second field using gawk:

gawk -F ',' '{print $1 $2 }'  example.log

# -F use to tell what the field separator character

From the example.log, the file requested is on the second last column. On line 1, its in field 6, meanwhile on second line, its on field 4.

In this case we can use NF keyword for gawk. NF would contain the number of field in each line.  To get the second last column, we can use (NF-1) as below:

gawk -F ','  '{print $1 $(NF-1) }' example.log

Hope that helps.

Leave a Reply