Loading CSV data into Elasticsearch with Logstash

Updated: Aug 19, 2020


Refer my previous blogs (Linux | Mac users) to install ELK stack on your machine. Once installation is done, there are couple ways to load CSV file into Elasticsearch which I am aware of (one via Logstash and another with filebeat). In this blog we will be using Logstash to load the file.


I am using sample Squid access logs (comma separated CSV file) in order to explain this blog. You can find the file format details at this link.


 

Sample Data


Go ahead copy paste these records to create access_log.csv file. I randomly selected this format because its CSV file which is widely used format in production environment and we have various types of attributes (data types) in the file.


$ more /Volumes/MYLAB/testdata/access_log.csv Time,Duration,Client_address,Result_code,Bytes,Request_method,Requested_URL,User,Hierarchy_code,Type 1121587707.473,60439,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.44:25,-,DIRECT/203.84.194.44,- 1121587709.448,61427,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.50:25,-,DIRECT/203.84.194.50,- 1121587709.448,61276,219.138.188.61,TCP_MISS/503,0,CONNECT,67.28.114.36:25,-,DIRECT/67.28.114.36,- 1121587709.449,60148,219.138.188.61,TCP_MISS/503,0,CONNECT,4.79.181.12:25,-,DIRECT/4.79.181.12,- 1121587710.889,60778,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.39:25,-,DIRECT/203.84.194.39,- 1121587714.803,60248,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.50:25,-,DIRECT/203.84.194.50,- 1121587714.803,59866,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.43:25,-,DIRECT/203.84.194.43,- 1121587719.834,60068,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.45:25,-,DIRECT/203.84.194.45,- 1121587728.564,59642,219.138.188.55,TCP_MISS/503,0,CONNECT,168.95.5.45:25,-,DIRECT/168.95.5.45,-


 

My file looks something like this,



Start Elastics


Now start Elasticsearch and Kibana (if you don't remember how to start them, refer my previous blogs Linux | Mac users). Don't kill these processes, both are required by logstash to load the data.


$ elasticsearch



$ kibana


 

Logstash Configuration


In order to read CSV file with Logstash, you need to create a configuration file which will have all the configuration details for access log file like input, filter & output.


In short,

  • Input tag contains details like filename, location, start position etc.

  • Filter tag contains file type, separator, column details, transformations etc.

  • Output tag contains host detail where file will be written, index name (should be in lower case), document type etc.

These tags look like JSON but actually these are not JSON. You can say these formats are specific to Logstash. I have created a config file under config directory in logstash as shown below.


$ more /usr/local/var/homebrew/linked/logstash-full/libexec/config/logstash_accesslog.config


input {

file {

path => "/Volumes/MYLAB/testdata/access_log.csv"

start_position => "beginning"

sincedb_path => "/Volumes/MYLAB/testdata/logstash.txt"

}

}


filter {

csv {

separator => ","

columns => [ "Time" , "Duration" , "Client_address" , "Result_code" , "Bytes" , "Request_method" , "Requested_URL" , "User" , "Hierarchy_code" , "Type" ]

}

date {

match => [ "Time" , "UNIX" ]

target => "EventTime"

}

mutate {convert => ["Duration", "integer"]}

mutate {convert => ["Bytes", "integer"]}

}


output {

elasticsearch {

hosts => "localhost"

index => "logstash-accesslog"

}

stdout {}

}

 

Explanation!


These are very basic tags and straightforward.

  • You use columns tag to define the list of fields within quotes (if you face issues use single quotes instead).

  • "mutate" is basically doing minor datatype conversion.

  • Also, "match" is used to convert UNIX timestamp to human-readable time format.

  • Further, logstash-accesslog is the index name which I am using for Squid access logs.

  • Still if there is any question regarding any of the tags please comment (in comment section below) and I will get back to you as soon as possible.

  • Also, you can change sincedb_path to /dev/null if you don't want to keep the track of loaded files.

  • If you want to reload the same file again make sure you delete the entry from sincedb_path file (logstash.txt in this case).


Here is my config file snapshot, zoom it little bit to see the content :)


 

Run Logstash & Load data


You are all set to start Logstash now with configuration file which we just created. Follow the below steps in order to run Logstash with config file. It will take few seconds to index all the records.


Change your logstash home location accordingly, mine is homebrew as I am using Mac.


$ /usr/local/var/homebrew/linked/logstash-full/bin/logstash -f /usr/local/var/homebrew/linked/logstash-full/libexec/config/logstash_accesslog.config