Sunday, November 4, 2018

Highly-Available and Load-Balanced Logstash

The Challenge

When using the Elastic Stack, I've found that Elasticsearch and Beats are great at load-balancing, but Logstash... not so much as they do not support clustering. The issues arise when you have end devices that do not support installing Beats agents which send to two or more Logstash servers. To get around this, you would typically:

  • Set up any one of the Logstash servers as the syslog/event destination
    • Pro: Only one copy of the data to maintain
    • Con: What if that server or Logstash input goes down?
  • Set up multiple Logstash servers as the syslog/event destinations
    • Pro: More likely to receive the logs during a Logstash server or input outage
    • Con: Duplicate copies of the logs to deal with
A third option that I've developed and laid out below contains all of the pros and none of the cons of the above options to provide a highly-available and load-balanced Logstash implementation. This solution is highly scalable as well. Let's get started.

Prerequisites

To begin creating this proof-of-concept solution, I began with a very minimal configuration:
  • Two virtual machines within same layer 2 domain (inside VMware Fusion)
    • CentOS 7 64-bit
    • Logstash 6.4.2
    • Java
    • Keepalived
    • IP Virtual Server (ipvsadm)
  • Host machine to generate some traffic (which will generate sample logs)
    • Mac OSX
    • nc

Log Server Configuration

OS install


For this, I simply created a small VMware Fusion virtual machine using the CentOS 7 Minimal ISO as my installation source (this one in particular). The rest of the machine creation is pretty straight-forward. (Note: I did change from NAT to Wi-Fi networking as I was having very strange issues with NAT networking)




After starting the virtual machine, the install process will begin. This is where you can just do a basic install, but I chose a few options that hit close to home with my day job:
  • Partition disk manually if intending to use a security policy (this would otherwise cause a security policy violation that will keep us from proceeding)

  • Configure static addressing (my Wi-Fi network within Fusion is 192.168.1.0/24 with a 192.168.1.1 gateway)
  • Apply the DISA STIG for CentOS Linux 7 because... security.
  • Don't forget to set the root password and create an administrative user. Without this, you'll have a hard time logging in (especially via SSH... given this security policy)


Application Install


From here, let the machine reboot and SSH in (it's a much better experience than using the console via Fusion, in my opinion). Some packages can now be added.
  • First, the Logstash and Load Balancing pre-requisite applications:
    • sudo yum -y install java tcpdump ipvsadm keepalived
  • Next, install Logstash per Elastic's best practices:
    • sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
    • sudo vi /etc/yum.repos.d/logstash.repo

      [logstash-6.x]
      name=Elastic repository for 6.x packages
      baseurl=https://artifacts.elastic.co/packages/6.x/yum
      gpgcheck=1
      gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
      enabled=1
      autorefresh=1
      type=rpm-md
    • sudo yum -y install logstash

    Logstash configuration


    There would be no way to show off all of the possible Logstash configurations (that's some research for you :) ), so I'll just set up a simple one for testing our highly-available Logstash cluster:
    • This is a bit different, but the API will need to be exposed outside the localhost:
      • sudo vi /etc/logstash/logstash.yml
        • Uncomment http.host and set to the server's IP address
        • Uncomment http.port and set to JUST 9600
    • The input and output configuration for Logstash is next (you can change the filename to something else... unless you agree). For this testing, I'm just setting up a raw UDP listener on port 5514 and writing to a file in /tmp.
      • sudo vi /etc/logstash/conf.d/ryanisawesome.conf
        input {
          udp {
            host => "192.168.1.210" # server's IP
            port => 5514
            id => "udp-5514"
          }
        }

        output {
          file {
            path => "/tmp/itworked.txt"
            codec => json_lines
          }
        }

    SELinux Tweaks


    There's a few settings that need changed to allow keepalived and ipvsadm to work properly.
    • Set the nis_enabled SELinux boolean to allow keepalived to call scripts which will access the network
      • sudo setsebool -P nis_enabled=1
    • Allow IP forwarding and binding to a nonlocal IP address
      • sudo vi /etc/sysctl.conf
        net.ipv4.ip_forward = 1
        net.ipv4.ip_nonlocal_bind = 1
        • If you chose the DISA STIG Policy during the VM build, comment out "net.ipv4.ip_forward = 0" (yes... this is a finding if this system is not a router. But once ipvsadm is running it IS a router. So we're all good ;) )
      • sudo sysctl -p

    Keepalived


    Here's where the real bread-and-butter of this setup lies: keepalived. This application is typically used to provide a virtual IP between two or more servers. If the primary server were to go down, the second (slave) would pick up the IP to avoid any substantial downtime. This is not a bad solution in regards to high-availability, but that means only one server will be online at a given time to process our logs. We can do better. 

    Another feature of keepalived is virtual_servers. With this, you can configure a listening port for our virtual IP and, when data is received, will forward to a pool of servers via a load-balancing method of your choosing. The configuration would look something like this:
    • sudo vi /etc/keepalived/keepalived.conf
      # Global Configuration
      global_defs {
        notification_email {
          notification@domain.org
        }
        notification_email_from keepalived@domain.org
        smtp_server localhost
        smtp_connect_timeout 30
        router_id LVS_MASTER
      }

      # describe virtual service ip
      vrrp_instance VI_1 {
        # initial state
        state MASTER
        interface ens33
        # arbitary unique number 0..255
        # used to differentiate multiple instances of vrrpd
        virtual_router_id 1
        # for electing MASTER, highest priority wins.
        # to be MASTER, make 50 more than other machines.
        priority 100
        authentication {
          auth_type PASS
          auth_pass secret42
        }
        virtual_ipaddress {
          192.168.1.230/24
        }
      }

      # describe virtual Logstash server
      virtual_server 192.168.1.230 5514 {
        delay_loop 5
        lb_algo rr
        lb_kind NAT
        ops
        protocol UDP

        real_server 192.168.1.210 5514 {
          MISC_CHECK {
            misc_path "/bin/python /etc/keepalived/inputstatus.py 192.168.1.210 udp-5514"
          }
        }
        real_server 192.168.1.220 5514 {
          MISC_CHECK {
            misc_path "/bin/python /etc/keepalived/inputstatus.py 192.168.1.220 udp-5514"
          }
        }
      }

    Logstash Health Checks


    You'll probably notice a reference to inputstatus.py in the above configuration. Keepalived will need to run an external script to determine whether or not the configured "real server" is eligible to receive the data. This is typically pretty easy to do with TCP... if a SYN, SYN/ACK, ACK is successful, we can assume the service is listening. This is not an option with a Logstash UDP input as nothing is sent back to confirm that the service is listening. What can be used instead is the API. The following script simply makes an API call to list the node's stats, parse the resulting list of inputs, and, if the input we're looking for is up, exit normally.


    • sudo vi /etc/keepalived/inputstats.py#!/bin/python
      import sys
      import urllib2
      import json

      if len(sys.argv) != 3:
          print "This script needs 3 arguments!: inputstatus.py IP input-id"
          exit(1)

      res = urllib2.urlopen('http://' + sys.argv[1] + ':9600/_node/stats').read()
      inputs = json.loads(res)['pipelines']['main']['plugins']['inputs']

      match = False

      for input in inputs:
          if sys.argv[2] == input['id']:
              match = True

      if match == True:
          exit(0)
      else:
          exit(1)
    Keepalived will add this server to the list of real servers if the exit code of our script is 0 and remove it from the list if it is anything except 0. The aforementioned keepalived configuration is set up to check this script every 5 seconds for minimal log loss if one goes down. Adjust as you see fit here (i.e., how much loss can you acceptably handle).

    Of course, you would have to create several of these if you have Logstash listening on multiple ports, but cut and paste is easy. Just look at /var/log/messages to ensure that these scripts are exiting properly. If you see a line like "Oct 30 09:44:58 stash1 Keepalived_healthcheckers[16141]: pid 16925 exited with status 1", either the script failed or a particular input is not up. Since this error message isn't the most descriptive, you'll have to manually test or view each input on each host to see which one it is. You can manually test the Logstash inputs (once that service is running) by issuing:

    • /bin/python /etc/keepalived/inputstatus.py <IP> <input-id>

    Firewall Rules


    Sure, we could just disable firewalld... but we did just expose our API to anything that can reach this machine, so we need to lock this down a bit better. Don't worry, the rules are pretty straight-forward. (Note: replace '192.168.1.111' with your host which is sending logs to Logstash and '192.168.1.210', '192.168.1.220', and '192.168.1.230' with the two Logstash servers and virtual IP address, in that order).
    • sudo firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.1.210/32 protocol value=vrrp accept' 
    • sudo firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.1.220/32 protocol value=vrrp accept'
    • sudo firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.1.210/32 destination address=192.168.1.220/32 port port=9600 protocol=tcp accept'
    • sudo firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.1.220/32 destination address=192.168.1.210/32 port port=9600 protocol=tcp accept'
    • sudo firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.1.111/32 destination address=192.168.1.230/32 port port=5514 protocol=udp accept'
    • sudo firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.1.111/32 destination address=192.168.1.210/32 port port=5514 protocol=udp accept'
    • sudo firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=192.168.1.111/32 destination address=192.168.1.220/32 port port=5514 protocol=udp accept'
    • sudo firewall-cmd --reload

    The Second Logstash server


    Shut down the Logstash server virtual machine since it's much easier to just clone this one and make a few configuration changes instead of stepping through this process all over again.

    Now that it's shut down...


    Boot the second one up (leaving the first powered off for now) and make the following changes in the VM console:
    • Set hostname
      • sudo hostnamectl set-hostname stash2
    • Set IP address
      • sudo vi /etc/sysconfig/network-scripts/ifcfg-<interface>
        • Change IPADDR to appropriate IP address
      • sudo systemctl restart network
    • Change Logstash listening IPs
      • sudo vi /etc/logstash/logstash.yml
        • Change http.host to stash2's IP address
      • sudo vi /etc/logstash/conf.d/ryanisawesome.conf
        • Change host to stash2's IP address
    • Swap the unicast_src_ip and unicast_peer IP addresses 
      • sudo vi etc/keepalived/keepalived.conf
    • Reboot
      • sudo reboot now
    Now, you should be able to start the original virtual machine (in my case, Stash1)

    Putting It All Together


    We've finally reached the point to fire up all the services and test out the HA Logstash configuration. On each Logstash VM:

    • sudo systemctl enable logstash
    • sudo systemctl start logstash
    • sudo systemctl enable keepalived
    • sudo systemctl start keepalived
    You can monitor that Logstash is up by viewing the output of:
    • sudo ss -nltp | grep 9600
    If you have no output, it's not up yet. If it doesn't come up after a few minutes, check out /var/log/logstash/logstash-plain.log to any error messages. Personally, I like to "tail -f" this file right after start logstash to ensure everything is working properly (plus it looks cool to those that look over your shoulder as all that nerdy text flies by).

    On each machine, you can now check that ipvsadm and keepalived are configured properly and playing nice together. You should be able to run the following command and get similar output (you IPs may be different, but you should see TWO real servers):
    • ip a
      • Only ONE of the two servers should have the virtual IP assigned (by default, the one with the higher IP address since the priority is the same and this is the tie-breaker when using VRRP)
    • sudo ipvsadm -ln
      IP Virtual Server version 1.2.1 (size=4096)
      Prot LocalAddress:Port Scheduler Flags
        -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
      UDP  192.168.1.230:5514 rr ops
        -> 192.168.1.210:5514           Masq    1      0          0       
        -> 192.168.1.220:5514           Masq    1      0          0 
    To test that load balancing is happening, the sample log source (in my case, my host operating system) will need to send some data over UDP 5514 to the virtual IP address. To do this, I'm going to use netcat (but really anything that can send data manually over UDP will work... including PowerShell). 
    • for i in $(seq 1 4); do echo "testing..." | nc -u -w 1 192.168.1.230; done
    What I just did was send four test messages to the virtual IP. If everything worked properly, the virtual server will have received the messages and load-balanced, in a round-robin fashion, to each server's /tmp/itworked.txt file. On each server, let's check it out.
    • cat /tmp/itworked.txt
      {"host":"192.168.1.111","@timestamp":"2018-11-04T17:34:37.065Z","message":"testing...\n","@version":"1"}
      {"host":"192.168.1.111","@timestamp":"2018-11-04T17:34:39.038Z","message":"testing...\n","@version":"1"}
    Success! Both servers received two messages!