Troubleshooting ELK Syslog Performance

Summary

When running Logstash in large scale environments it can be quite difficult to troubleshoot performance specifically when dealing with UDP packets.

The issue could occur at multiple layers, in order of dependent layers of concern:

  • Infrastructure
  • Logstash Application
  • Pipeline

The following steps assume installation of Logstash on a Linux machine (CentOS 7.4) but similar steps can be used for other machines.

1. Troubleshooting Infrastructure

Issue: Communication issues from source

Diagnose:

  1. Dump all packets on a protocol and port (Run on OS with Logstash) to check whether you are receiving data:
    tcpdump -i ens160 udp
    
  2. If it is TCP traffic that is being troubleshooted, you can telnet the port from the source to destination to determine the issue. Example below is run from the source to the destination to diagnose traffic flow to port 514 on Logstash with ip 10.10.10.4.
    telnet 10.10.10.4 514
    

Fixes:

  1. Check all interim networking devices (Firewalls, load-balancers, switches etc.) and ensure at every leg the traffic is getting through.

Issue: Dropped UDP Packets

Diagnose:

  1. View the network statistics (Run on OS with Logstash) to check whether your operating system is dropping packets.
    watch netstat -s --udp
    
    A good read on how to view the results of this command can be found here

Fixes:

  1. If there is packet loss, check the CPU of the nodes the Logstash is pointed at (should be hot).

  2. Commercial Only: Check the pipeline via monitoring to verify where there is a high processing time.

2. Troubleshooting Logstash Application

Issue: Logstash keeps restarting

Diagnose:

  1. Print the journal of the service to see the errors
journalctl -u logstash.service
  1. Cat logs stored at /var/log/logstash/~

Fix:

  1. The application maybe trying to listen on port 514 with insufficient permission, you can use iptables to forward the traffic to a privileged port. Discussion can be found here.
  2. Commercial Only (X-Pack security): The application maybe failing to connect to the Elasticsearch nodes due to incorrect certificate, check that the assigned CA is correct.

3. Troubleshooting Pipelines

Issue: Pipeline is not passing logs to Elasticsearch

Diagnose:

  1. Cat logs stored at /var/log/logstash/~
  2. Review the pipeline to ensure output is using Elasticsearch output plugin, add a stdout output to ensure logs are reaching end of pipeline
  3. Check the inputs to ensure the right port is binded

Fix

  1. Instead of using the syslog input, swap to the tcp/udp input to diagnose whether it is the input plugin
  2. Check all drop() commands in the filters

The L in ELK+Docker Scale-out Logging

Warning: This article assumes a basic understanding of

  1. Docker
  2. Elasticsearch
  3. Logstash

Why Log to Elasticsearch?

Elasticsearch is a fantastic tool for logging as it allows for logs to be viewed as just another time-series piece of data. This is important for any organizations’ journey through the evolution of data.

This evolution can be outlined as the following:

  1. Collection: Central collection of logs with required indexing
  2. Shallow Analysis: The real-time detection of specific data for event based actions
  3. Deep Analysis: The study of trends using ML/AI for pattern driven actions

Data that is not purposely collected for this journey will simply be bits wondering through the abyss of computing purgatory without a meaningful destiny! In this article we will be discussing using Docker to scale out your Logstash deployment.

The Challenges

If you have ever used Logstash (LS) to push logs to Elasticsearch (ES) here are a number of different challenges you may encounter:

  • Synchronization: of versions when upgrading ES and LS
  • Binding: to port 514 on a Linux host as it is a reserved port
  • High availability: with 100% always-on even during upgrades
  • Configurations management: across Logstash nodes

When looking at solutions, the approach I take is:

  • Maintainability
  • Reliability
  • Scalability

The Solution

Using Docker, a generic infrastructure can be deployed due to the abstraction of containers and underlying OS (Besides the difference between Windows and Linux hosts).

Docker solves the challenges inherent in the LS deployment:

  • Synchronization: Service parallelism is managed by Docker and new LS versions can be deployed just by editing the Docker Compose file
  • Port bindings: The docker service is able to bind on Linux hosts port 514 which can be forwarded to the input port exposed by your pipeline
  • High availability: Using replica settings, you can ensure that a certain amount of LS deployments will always be deployed as long as there is always one host available. This makes workload requirements after load testing purely a calculation as follows:
  1. Max required logs per LS instance = (Maximum estimated Logs)/(N Node)
  2. LS container size = LoadTestNode(Max required logs per LS instance)
  3. Node Size = LS container size * 1 + (Max N Node Loss)

I.e. Let’s say you have 1M logs required to be logged per day, and have a requirement to have 3 virtual machines for a maximum of 1 virtual machine loss.

  1.  333,333 = 1M / 3
  2. Assume 2 CPU, 4 GB RAM
  3. 4 CPU, 8 GB = (2 CPU , 4 GB RAM) * 2

Why not deploy straight onto OS 3 Logstashes sized at 4 CPU and 8 GB RAM?

  • Worker number does not scale automatically when you increase the CPU count and RAM, as such even with more CPU and RAM you cannot ensure that you will effectively be able to continue ingesting the same amount of logs in a “Node down” situation
  • Upgrading the Logstashes become a OS level activity with associated headache
  • Restrictive auto-scaling capability
  • No central stout view of Logstashes

Architecture

Let’s take a look at how this architecture looks,

Node_Up

When a node goes down the resulting environment looks like:

Node_Down

A added bonus to this deployment is if you wanted to ship logs from Logstash to Elasticsearch for central and real-time monitoring of the logs its as simple as adding Filebeats in the docker-compose.

FBWLS.png

What does the docker-compose look like?

version: '3.3'

services:
logstash:
image: docker.elastic.co/logstash/logstash:6.4.0#Change image to change LS version
volumes:
- /etc/logstash:/usr/share/logstash/config #Location of the config
networks:
- logstash
ports:
- 1514:1514/udp
- 9600:9600
- 1514:1514
- 514:1514/udp # Map from host 514 to a defined pipeline for 1514
- 514:1514
deploy:
mode: global
placement:
constraints:
- node.role == manager
update_config:
parallelism: 1
delay: 10s
resources: #LS size
limits:
cpus: 1
memory: 50M
restart_policy:
condition: on-failure
networks:
logstash:
driver: overlay

The steps to implement the basic solution (without Filebeats):
  1. Estimate the Virtual Machines (VM) sizes and LS sizes based on estimated ingestion of logs and required redundancy
  2. Deploy VM and install docker 
  3. Create a docker swarm
  4. Write a logstash.yml and either include the pipeline.yml or if you are using ES configure centralised pipeline management.
  5. Copy the config to all VM’s to the same file location OR create a shared file system (that is also HA) and store the files there to centrally manage config
  6. Start your docker stack after customizing the docker-compose file shared in this article
  7. Set up a external load balancer pointing at all your virtual machines
  8. Enjoy yummy logs!

The BUT!

As with most good things, there is a caveat. With Docker you add another layer of complexity however I would argue that as the docker images for Logstash are managed and maintained by Elasticsearch, it reduces the implementation headaches.

In saying this I found one big issue with routing UDP traffic within Docker.

This issue will cause you to lose a proportional number of logs after container re-deployments!!!

  • What is your current logging deployment?
  • Have any questions or comments?
  • Are interested seeing a end-to-end video of this deployment?
  • Comment below!

 

Disclaimer: This article only represents my personal opinion and should not be considered professional advice. Healthy dose of skeptism is recommended.

C# Time-savers for Elasticsearch

1. Context

If you are currently developing using C# (Particularly .NET Core 2.0+) here are some shortcuts I hope will be able to save you time I wish I could have back.

There is official documentation for C# Elasticsearch development however I found the examples to be quite lacking. I do recommend going through the documentation anyway especially for the NEST client as it is essential to understand Elasticsearch with C#.

1. Low Level Client

“The low level client, ElasticLowLevelClient, is a low level, dependency free client that has no opinions about how you build and represent your requests and responses.”

ElasticSearch Official Documentation

Unfortunately the low level client in particular has very sparse documentation especially examples. The following was discovered through googling and painstaking testing.

1.1. Using  JObjects in Elasticsearch

JObjects are quite popular way to work with JSON objects in .NET, as such it may be required to parse JObjects to Elasticsearch, this may be a result of one of the following:

  • Definition of the object is inherited from a different system and only parsed to Elasticsearch via your application (i.e. micro-service)
  • Too lazy to strongly define each object as it is unnecessary

The JObject cannot be used as the generic for indexing as you will receive this error:

70cb251f-fb84-4495-a518-bb028d60ced9
Figure 1: ‘JObject’ Cannot be used as type parameter

Instead use “BytesResponse” as the <T> Class

c0b5c968-1772-4639-a1ce-5bd286fb1365
Figure 2: Using BytesResponse

1.2. Running a “Bool” query

The examples given by the Elasticsearch documentation does not give an example of a bool query using the low-level client. Why is the “Bool” query particularly difficult? Using Query DSL in C#, “bool” will automatically resolve to the class and therefore will throw a error:

57735edf-10ac-4bcf-b652-44803f6d6653
Figure 3: bool Error

Not very anonymous type friendly… the solution to this one is quite simple, add a ‘@’ character in-front of the bool.

93a84b9e-c9bd-4b39-8207-2046a516773a
Figure 4: Anonymous bool Fix

1.3. Defining Anonymous Arrays

This one seems a-bit obvious but if you want to define an array for use with DSL, use the anonymous typed Array (Example can be seen in figure 4) new Object[].

1.4. Accessing nested fields in searches

Nested fields in Elasticsearch are stored as a full path, . delimited string. This creates a problem when trying to query that field specifically as it creates a invalid type for anonymous types.

fc91bf91-d27d-4358-8621-151abc285d77
Figure 5: Nested Field Error

The solution is to define a Dictionary and use the dictionary in the anonymous type.

80642ca4-05ef-4739-bc80-24af5b895527.png
Figure 6: Nested Field Fix

The Dictionary can be passed by the anonymous type and will successfully query the Nested field in Elasticsearch.

2. NEST Client

“The high level client, ElasticClient, provides a strongly typed query DSL that maps one-to-one with the Elasticsearch query DSL.”

ElasticSearch Official Documentation

The NEST documentation is much more comprehensive, the only issue I found was using keyword Term searches.

2.1. Using Keyword Fields

All string fields are mapped by default to both text and keyword, the documentation can be found here. Issue is that in the strong typed object used in the Elastic Mapping there is no “.keyword” field to reference therefore a error is thrown.

Example:

For the Object:

public class SampleObject
{
public string TextField { get; set; }
}

Searching would look like this

d02803f8-113d-42bd-b1ce-eac3850d1bcd
Figure 7: Keyword Field Error

Unfortunately the .Keyword field does not exist, the solution is using the .Suffix function using property name inference. This is documented in the docs however it is not immediately apparent that is how you access “keyword”.

f1da2aca-85a0-43b8-81bf-1be26ebb5ebf
Figure 8: Keyword Fix

I hope this post was helpful and saved you some time. If you have any tips of your own please comment below!

 

Deploying a SSL Protected Containerized App: Part 3

Checklist

Let’s quickly do a checklist of what we have so far

  1. SSH Accessible Virtual Machine (Running Centos 7.4)
  2. Ports 22, 443, 80 are open on the virtual machine
  3. Domain pointed at the public IP of the Virtual machine
  4. SSL Certificate generated on the virtual machine
  5. Docker CE installed on the virtual machine

If you have not completed the steps above, review part 1 and part 2.

Deploying the Final Stack

SSH into the virtual machine and swap to the root user.

Move to the root directory of the machine (Running cd /)

Creating our directories

Create two directories (This is done for simplicity)

  • certs – This will be used to store the SSL certificates to be used in our NGINX container

Mkdir /certs

  • docker – This will be used to store our docker related files (docker-compose.yml

Mkdir /docker

Swap to the docker directory

cd /docker

Create a docker compose file with the following content (It is case and space sensitive, read more about docker compose).

Moving and renaming our SSL Certificates

Unfortunately, Nginx-Proxy must read the SSL certificates as <domain name>.crt and the key as <domain name>.key. as such we need to move and rename the original certificates generated for our domain.

Run the following commands to copy the certificates to the relevant folders and rename:

cp /etc/letsencrypt/live/<your domain>/fullchain.pem /certs/<your domain>.crt

cp /etc/letsencrypt/live/<your domain>/privkey.pem /certs/<your domain>.key

Creating a docker-compose.yml file

The docker compose file will dictate our stack.

Run  the following command to create the file at /docker/docker-compose.yml

vi /docker/docker-compose.yml

Populate the file with the following content

Line by line:

version: "3.3"
services:  
  nginx-proxy:
    image: jwilder/nginx-proxy #nginx proxy image
    ports:
      - "443:443"  #binding the host port 443 to container 443 port
    volumes:
      - /var/run/docker.sock:/tmp/docker.sock:ro      
      - /certs:/etc/nginx/certs #Mounting the SSL certificates to the image 
    networks:
     -  webnet  
  visualizer:
    image: dockersamples/visualizer:stable 
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"   
    environment:
      - VIRTUAL_HOST=<Your DOMAIN ie. domain.com.au>
networks:      
  - webnet

environment:

– VIRTUAL_HOST=<your domain ie. Domain.com.au>

networks:

webnet:

Save the file by press esc than :wq

Starting the stack

Start docker

systemctl start docker

Pull the images

docker pull jwilder/nginx-proxy:latest

docker pull dockersamples/visualizer

Start the swarm

docker swarm init

Deploy the swarm

docker stack deploy -c /docker/docker-compose.yml test-stack

Congratulations! If you have done everything right you should now have a SSL protected visualizer when you browse https://<your domain>

99018847-9a93-41d1-970f-0129bcfa487b
Figure 1: Final Stack Visualization

Troubleshooting

To troubleshoot any problems check all services have a running container by running

docker service ls

2e212e5c-a1ef-45c5-bf22-0fc1cbda10e9
Figure 2: Example of troubleshooting output

Check the replicas count. If the nginx image is not running, check that the mounted .certs path does exist.

If the nginx container is running, you can run

docker service <service Id> logs --follow

then try access the https://<your domain> and see whether the connection is coming through.

  • If it is than check the environment variable in your docker-compose
  • If it is not than check that the port 443 is open and troubleshoot connectivity to the server

Deploying a SSL Protected Containerized App: Part 2

Checklist

Let’s quickly do a checklist of what we have so far

  1. SSH Accessible Virtual Machine (Running Centos 7.4)
  2. Ports 22, 443, 80 are open on the virtual machine
  3. Domain pointed at the public IP of the Virtual machine

If you have not done these things, you can deploy your virtual machine following the steps in part 1.

Preparing the Host

Start this part by initializing a SSH session into the virtual machine.

Swap to the root user by running

su root

Installing Docker

Install docker

On the virtual machine that you have deployed run the following commands:

sudo yum install -y yum-utils  device-mapper-persistent-data  lvm2

​​​​​​sudo yum-config-manager    --add-repo https://download.docker.com/linux/centos/docker-ce.repo

sudo yum install docker-ce

Note: These are the quick commands to install docker, for more information as to what they do exactly visit the docs.

Downloading CertBot

Certbot is a nifty client that will fetch SSL/TLS certificates and is used as the client for Let’s Encrypt.

Download Cert Bot

Pre-requisites:

yum -y install yum-utils
yum install epel-release

Run installation:

sudo yum install certbot

Note: These are the quick commands to install certbot, for more information as to what they do exactly visit the docs.

Generating a SSL Certificate

On the virtual machine that you have deployed run the following commands:

When running certbot to obtain a SSL certificate, too many attempts will result in a lockout of the domain of up to a hour. To prevent a lockout we will be testing the creation of the certificate with a –staging command.

sudo certbot certonly --staging

Run through the prompts and at the very end enter your domain address (domain.com.au).

The successful output is shown below

92f0915b-1ae6-4894-b127-899415893848

Once you can confirm that a staging certificate can be generated, run the process again without the --staging tag.

Once you have completed the deployment of a production ready SSL certificate, you can now move on to part 3.