Aggregate logs of spark running on an EMR cluster with Fluent Bit

2022-09-04 aws spark monitoring fluentd newrelic

If Spark jobs run on Cluster mode, the logs are not outputted to step/ directory, so it is hard to check it on the console, so try aggregating them to New Relic.

Launch an EMR cluster with AWS CLI and run Spark applications - sambaiz-net

Monitor infrastructure and applications with New Relic - sambaiz-net

Option 1. Sending logs with self installed fluent bit

Install Fluent Bit that is memory saving fluentd, and send logs with it.

Install Fluent Bit and settings

Install Fluent Bit on bootstrap actions, and place New Relic’s plugin.

From Fluent Bid 1.9, a settings file can be written in YAML, but is currently “not recommended for production”, so I wrote it with the current format this time.

On Cluster mode, Logs of stdout/stderr are outputted to /mnt/var/log/hadoop-yarn/containers/application_id/container_id/(stdout|stderr) in Core node running Driver.

#!/bin/bash -e

sudo tee /etc/yum.repos.d/fluent-bit.repo << EOF > /dev/null
[fluent-bit]
name = Fluent Bit
baseurl = https://packages.fluentbit.io/amazonlinux/2/\$basearch/
gpgcheck=1
gpgkey=https://packages.fluentbit.io/fluentbit.key
enabled=1
EOF

sudo yum install -y fluent-bit-1.9.7-1

cd /etc/fluent-bit/

sudo wget https://github.com/newrelic/newrelic-fluent-bit-output/releases/download/v1.14.0/out_newrelic-linux-arm64-1.14.0.so
sudo tee plugins.conf << EOF > /dev/null
[PLUGINS]
    Path /etc/fluent-bit/out_newrelic-linux-arm64-1.14.0.so
EOF

sudo tee fluent-bit.conf << EOF > /dev/null
[SERVICE]
    plugins_file plugins.conf
    http_server  On
    http_listen  0.0.0.0
    http_port    2020

[INPUT]
    name tail
    path /mnt/var/log/hadoop-yarn/containers/*/*/stdout
    
[FILTER]
    name modify
    match *
    add emr_cluster_name $(aws emr list-clusters --query "Clusters[?Id=='$(sudo cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId")']" --region <region> | jq -r ".[0].Name")

[OUTPUT]
    name newrelic
    match *
    licenseKey \${NEW_RELIC_LICENSE_KEY}
EOF

sudo mkdir -p /lib/systemd/system/fluent-bit.service.d
sudo tee /lib/systemd/system/fluent-bit.service.d/newrelicenv.conf << EOF > /dev/null
[Service]
    Environment="NEW_RELIC_LICENSE_KEY=<license_key>"
EOF

sudo systemctl start fluent-bit

Send Fluent Bit metrics

Launching with http_server On, the following metrics can be got.

$ curl localhost:2020/api/v1/metrics | jq
{
  "input": {
    "tail.0": {
      "records": 0,
      "bytes": 0,
      "files_opened": 0,
      "files_closed": 0,
      "files_rotated": 0
    }
  },
  "filter": {},
  "output": {
    "newrelic.0": {
      "proc_records": 0,
      "proc_bytes": 0,
      "errors": 0,
      "retries": 0,
      "retries_failed": 0,
      "dropped_records": 0,
      "retried_records": 0
    }
  }
}

Install New Relic Infrastructure agent, and send metrics with New Relic Flex.

#!/bin/bash -e

sudo curl -o /etc/yum.repos.d/newrelic-infra.repo https://download.newrelic.com/infrastructure_agent/linux/yum/amazonlinux/2/aarch64/newrelic-infra.repo
sudo yum -q makecache -y --disablerepo='*' --enablerepo='newrelic-infra'
sudo yum install newrelic-infra -y

NEW_RELIC_LICENSE_KEY=<license_key>
echo "license_key: ${NEW_RELIC_LICENSE_KEY}" | sudo tee -a /etc/newrelic-infra.yml
    
sudo tee /etc/newrelic-infra/integrations.d/fluentbit.yml << EOF > /dev/null
integrations:
- name: nri-flex
  config:
    name: fluentbit
    apis:
    - event_type: fluentbit
      url: http://localhost:2020/api/v1/metrics
EOF
    
sudo systemctl start newrelic-infra

Fetch the values with NRQL.

FROM fluentbit SELECT max(output.newrelic.0.errors) FACET tags.Name TIMESERIES

Option 2. Send logs with Infrastructure agent’s Fluent Bit

In fact, Infrastructure agent also have a feature sending logs with Fluent Bit, and if write settings in /etc/newrelic-infra/logging.d/logging.yml, logs are sent. It can import normal settings files of Fluent Bit, and OUTOUT to New Relic is dynamically inserted.

sudo tee /etc/newrelic-infra/logging.d/fluentbit.conf << EOF > /dev/null
[SERVICE]
    plugins_file plugins.conf
    http_server  On
    http_listen  0.0.0.0
    http_port    2020

[INPUT]
    name tail
    path /mnt/var/log/hadoop-yarn/containers/*/*/stdout

[FILTER]
    name modify
    match *
    add emr_cluster_name $(aws emr list-clusters --query "Clusters[?Id=='$(sudo cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId")']" --region ap-northeast-1 | jq -r ".[0].Name")
EOF

sudo tee /etc/newrelic-infra/logging.d/logging.yml << EOF > /dev/null
logs:
  - name: fluentbit-import
    fluentbit:
      config_file: /etc/newrelic-infra/logging.d/fluentbit.conf
EOF

References

hadoop - Where does YARN application logs get stored in EMR before sending to S3 - Stack Overflow