AWS cloudwatch and Telegraf metrics collection comparison
We deal with performance issues and performance metrics analysis on behalf of our customers on a daily basis. So we have to have an idea about limitations, capabilities and comparative advantages of various hardware monitoring and metrics collection tools.

This short review is about comparison of Telegraf (awesome Open Source solution, https://github.com/influxdata/telegraf) and AWS native monitoring service, Cloudwatch.
Disk read and Write operations differencies
AWS cloudwatch sees only 20% of disk read and write operations that telegraf tracks.

It seems that there is some aggregation of operations in Cloudwatch.

It could be a drawback for workloads where disk speed is a bottleneck.

Similar code inside
Cloudwatch launch log
2020/01/14 15:20:45 I! I! Detected the instance is EC2
2020/01/14 15:20:45 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it.
2020/01/14 15:20:45 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_config.json ...
Valid Json input schema.
I! Detecting runasuser...
No csm configuration found.
No log configuration found.
Configuration validation first phase succeeded
2020/01/14 15:20:45 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2020/01/14 15:20:45 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
2020/01/14 15:20:45 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_config.json ...
2020/01/14 15:20:45 I! Detected runAsUser: ubuntu
2020/01/14 15:20:45 I! Change ownership to ubuntu:ubuntu
2020/01/14 15:20:45 I! Set HOME: /home/ubuntu
2020-01-14T15:20:45Z I! cloudwatch: get unique roll up list []
2020-01-14T15:20:45Z I! Starting AmazonCloudWatchAgent (version 1.232905.0)
2020-01-14T15:20:45Z I! Loaded outputs: cloudwatch
2020-01-14T15:20:45Z I! Loaded inputs: disk diskio mem netstat swap cpu
2020-01-14T15:20:45Z I! Tags enabled: host=ip-172-31-18-92
2020-01-14T15:20:45Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"ip-172-31-18-92", Flush Interval:1s
2020-01-14T15:20:45Z I! cloudwatch: publish with ForceFlushInterval: 1m0s, Publish Jitter: 37s
Telegraf log
Jan 13 23:34:50 roshop2 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Jan 13 23:34:50 roshop2 telegraf[31579]: 2020-01-13T22:34:50Z I! Starting Telegraf 1.13.1
Jan 13 23:34:50 roshop2 telegraf[31579]: 2020-01-13T22:34:50Z I! Loaded inputs: nginx rabbitmq disk diskio kernel system memcached net cpu mem processes swap logparser
Jan 13 23:34:50 roshop2 telegraf[31579]: 2020-01-13T22:34:50Z I! Loaded aggregators:
Jan 13 23:34:50 roshop2 telegraf[31579]: 2020-01-13T22:34:50Z I! Loaded processors:
Jan 13 23:34:50 roshop2 telegraf[31579]: 2020-01-13T22:34:50Z I! Loaded outputs: influxdb
Jan 13 23:34:50 roshop2 telegraf[31579]: 2020-01-13T22:34:50Z I! Tags enabled: host=roshop2 project=RoShop
Jan 13 23:34:50 roshop2 telegraf[31579]: 2020-01-13T22:34:50Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"roshop2", Flush Interval:10s
Cloudwatch config file
[agent]
collection_jitter = "0s"
debug = false
flush_interval = "1s"
flush_jitter = "0s"
hostname = ""
interval = "10s"
logfile = "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log"
metric_batch_size = 1000
metric_buffer_limit = 10000
omit_hostname = false
precision = ""
quiet = false
round_interval = false
[inputs]
[[inputs.cpu]]
fieldpass = ["usage_idle", "usage_iowait", "usage_user", "usage_system"]
interval = "10s"
percpu = true
totalcpu = false
[inputs.cpu.tags]
"aws:StorageResolution" = "true"
metricPath = "metrics"
[[inputs.disk]]
drop_device = false
fieldpass = ["used_percent", "inodes_free", "free"]
interval = "10s"
[inputs.disk.tags]
"aws:StorageResolution" = "true"
metricPath = "metrics"
[[inputs.diskio]]
fieldpass = ["io_time", "write_bytes", "read_bytes", "writes", "reads"]
interval = "10s"
report_deltas = true
[inputs.diskio.tags]
"aws:StorageResolution" = "true"
metricPath = "metrics"
[[inputs.mem]]
fieldpass = ["used_percent", "free"]
interval = "10s"
[inputs.mem.tags]
"aws:StorageResolution" = "true"
metricPath = "metrics"
[[inputs.netstat]]
fieldpass = ["tcp_established", "tcp_time_wait"]
interval = "10s"
[inputs.netstat.tags]
"aws:StorageResolution" = "true"
metricPath = "metrics"
[[inputs.swap]]
fieldpass = ["used_percent"]
interval = "10s"
[inputs.swap.tags]
"aws:StorageResolution" = "true"
metricPath = "metrics"
[outputs]
[[outputs.cloudwatch]]
force_flush_interval = "60s"
namespace = "CWAgent"
region = "us-east-2"
tagexclude = ["host", "metricPath"]
[outputs.cloudwatch.tagpass]
metricPath = ["metrics"]
[processors]
[[processors.ec2tagger]]
ec2_instance_tag_keys = ["aws:autoscaling:groupName"]
ec2_metadata_tags = ["ImageId", "InstanceId", "InstanceType"]
refresh_interval_seconds = "2147483647s"
[processors.ec2tagger.tagpass]
metricPath = ["metrics"]Cloudwatch config file
Telegraf metrics input
CPU metrics
  • time_user (float)
  • time_system (float)
  • time_idle (float)
  • time_active (float)
  • time_nice (float)
  • time_iowait (float)
  • time_irq (float)
  • time_softirq (float)
  • time_steal (float)
  • time_guest (float)
  • time_guest_nice (float)
  • usage_user (float, percent)
  • usage_system (float, percent)
  • usage_idle (float, percent)
  • usage_active (float)
  • usage_nice (float, percent)
  • usage_iowait (float, percent)
  • usage_irq (float, percent)
  • usage_softirq (float, percent)
  • usage_steal (float, percent)
  • usage_guest (float, percent)
  • usage_guest_nice (float, percent)
https://github.com/influxdata/telegraf/tree/master/plugins/inputs/cpu
Disk io
  • reads (integer, counter)
  • writes (integer, counter)
  • read_bytes (integer, counter, bytes)
  • write_bytes (integer, counter, bytes)
  • read_time (integer, counter, milliseconds)
  • write_time (integer, counter, milliseconds)
  • io_time (integer, counter, milliseconds)
  • weighted_io_time (integer, counter, milliseconds)
  • iops_in_progress (integer, gauge)
https://github.com/influxdata/telegraf/tree/master/plugins/inputs/diskio
Memory
  • memory_data (int)
  • memory_locked (int)
  • memory_rss (int)
  • memory_stack (int)
  • memory_swap (int)
  • memory_usage (float)
  • memory_vms (int)
https://github.com/influxdata/telegraf/blob/cbe7d33bd4d8b243975354668c1393609d62112a/plugins/inputs/mem/README.md
AWS Cloudwatch agent metrics input
AWS CPU metrics
cpu_time_active
cpu_time_guest
cpu_time_guest_nice
cpu_time_idle
cpu_time_iowait
cpu_time_irq
cpu_time_nice
cpu_time_softirq
cpu_time_steal
cpu_time_system
cpu_time_user
cpu_usage_active
cpu_usage_guest
cpu_usage_guest_nice
cpu_usage_idle
cpu_usage_iowait
cpu_usage_irq
cpu_usage_nice
cpu_usage_softirq
cpu_usage_steal
cpu_usage_system
cpu_usage_user

AWS disk metrics
disk_inodes_free

disk_inodes_total

disk_inodes_used

disk_total

disk_used

disk_used_percent

diskio_iops_in_progress

diskio_io_time

diskio_reads

diskio_read_bytes

diskio_read_time

diskio_writes

diskio_write_bytes

diskio_write_time

AWS Memory metrics
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 17.0px 'Helvetica Neue'} net_bytes_recv

net_bytes_sent

net_drop_in

net_drop_out

net_err_in

net_err_out

net_packets_sent

net_packets_recv

netstat_tcp_close

netstat_tcp_close_wait

netstat_tcp_closing

netstat_tcp_established

netstat_tcp_fin_wait1

netstat_tcp_fin_wait2

netstat_tcp_last_ack

netstat_tcp_listen

netstat_tcp_none

netstat_tcp_syn_sent

netstat_tcp_syn_recv

netstat_tcp_time_wait

netstat_udp_socket

processes_blocked

processes_dead

processes_idle

processes_paging

processes_running

processes_sleeping

processes_stopped

processes_total

processes_total_threads

processes_wait

processes_zombies

swap_free

swap_used

swap_used_percent