top of page

Loading CSV data into Elasticsearch with Logstash

Updated: Dec 2, 2022


Refer to my previous blogs (Linux | Mac users) to install the ELK stack on your machine. Once installation is done, there are couple of ways to load CSV files into Elasticsearch which I am aware of (one via Logstash and another with filebeat). In this blog, we will be using Logstash to load the file.


I am using sample Squid access logs (comma-separated CSV file) in order to explain this blog. You can find the file format details at this link.



Sample Data

Copy and paste these records to create an access_log.csv file. I randomly selected this format because it's a CSV file which is a widely used format in the production environment and we have various types of attributes (data types) in the file.


$ more /Volumes/MYLAB/testdata/access_log.csv Time,Duration,Client_address,Result_code,Bytes,Request_method,Requested_URL,User,Hierarchy_code,Type 1121587707.473,60439,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.44:25,-,DIRECT/203.84.194.44,- 1121587709.448,61427,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.50:25,-,DIRECT/203.84.194.50,- 1121587709.448,61276,219.138.188.61,TCP_MISS/503,0,CONNECT,67.28.114.36:25,-,DIRECT/67.28.114.36,- 1121587709.449,60148,219.138.188.61,TCP_MISS/503,0,CONNECT,4.79.181.12:25,-,DIRECT/4.79.181.12,- 1121587710.889,60778,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.39:25,-,DIRECT/203.84.194.39,- 1121587714.803,60248,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.50:25,-,DIRECT/203.84.194.50,- 1121587714.803,59866,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.43:25,-,DIRECT/203.84.194.43,- 1121587719.834,60068,219.138.188.61,TCP_MISS/503,0,CONNECT,203.84.194.45:25,-,DIRECT/203.84.194.45,- 1121587728.564,59642,219.138.188.55,TCP_MISS/503,0,CONNECT,168.95.5.45:25,-,DIRECT/168.95.5.45,-



My file looks something like this,



Start Elastics


Now start Elasticsearch and Kibana (if you don't remember how to start them, refer to my previous blogs Linux | Mac users). Don't kill these processes, both are required by logstash to load the data.


$ elasticsearch



$ kibana



Logstash Configuration

In order to read the CSV files with Logstash, you need to create a configuration file that will have all the configuration details for accessing the log file like input, filter & output.


In short,

  • The input tag contains details like filename, location, start position, etc.

  • Filter tag contains file type, separator, column details, transformations, etc.

  • Output tag contains host detail where the file will be written, index name (should be in lower case), document type, etc.

These tags look like JSON but actually, these are not JSON. You can say these formats are specific to Logstash. I have created a config file under the config directory in logstash as shown below.


$ more /usr/local/var/homebrew/linked/logstash-full/libexec/config/logstash_accesslog.config


input {

file {

path => "/Volumes/MYLAB/testdata/access_log.csv"

start_position => "beginning"

sincedb_path => "/Volumes/MYLAB/testdata/logstash.txt"

}

}


filter {

csv {

separator => ","

columns => [ "Time" , "Duration" , "Client_address" , "Result_code" , "Bytes" , "Request_method" , "Requested_URL" , "User" , "Hierarchy_code" , "Type" ]

}

date {

match => [ "Time" , "UNIX" ]

target => "EventTime"

}

mutate {convert => ["Duration", "integer"]}

mutate {convert => ["Bytes", "integer"]}

}


output {

elasticsearch {

hosts => "localhost"

index => "logstash-accesslog"

}

stdout {}

}


Explanation!

These are very basic tags and straightforward.

  • You use the columns tag to define the list of fields within quotes (if you face issues use single quotes instead).

  • "mutate" is basically doing minor datatype conversion.

  • Also, "match" is used to convert the UNIX timestamps to human-readable time format.

  • Further, a logstash-access log is the index name that I am using for Squid access logs.

  • Still, if there is any question regarding any of the tags please comment (in the comment section below) and I will get back to you as soon as possible.

  • Also, you can change sincedb_path to /dev/null if you don't want to keep the track of loaded files.

  • If you want to reload the same file again make sure you delete the entry from the sincedb_path file (logstash.txt in this case).


Here is my config file snapshot, zoom it a little bit to see the content :)