Monitoring on AWS with CloudWatch Agent and Procstat

Objective: Install CloudWatch Agent with procstat on an EC2 instance and configure a metric alarm in CloudWatch

One of the first issues I ran into was with IAM Policies, or lack thereof . Specifically it was the managed policy CloudWatchAgentServerPolicy which needed to be added. The telltale sign that you forgot to add this Policy is an error message in the Agent logs, seen below

2020-08-17T22:46:18Z E! refresh EC2 Instance Tags failed: NoCredentialProviders: no valid providers in chain<br>caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.

The procstat plugin fortunately is already part of the Agent from install, but it still needs to be configured. In order to do this you have to add a configuration file specific to your monitoring needs. For old school admins the easiest way to think of procstat is that it basically ties into the ps tool. It’s like doing a `ps -ef | grep` to find something about a running process.

[root@lab-master amazon-cloudwatch-agent.d]# pwd
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d
[root@lab-master amazon-cloudwatch-agent.d]# cat processes
{
    "metrics": {
        "metrics_collected": {
            "procstat": [
                {
                    "pattern": "nginx: master process /usr/sbin/nginx",
                    "measurement": [
                        "pid_count"
                    ]
                }
            ]
        }
    }
}

This will get us far enough that now we can see values in the Metrics view of CloudWatch. Once we have data there its time to construct a metric alarm. My goal was to use Terraform however its less painful to do in the AWS console.

resource "aws_cloudwatch_metric_alarm" "nginx-master" {
  alarm_name = "nginx master alarm"
  comparison_operator = "LessThanThreshold"
  evaluation_periods = 1
  datapoints_to_alarm = 1
  metric_name = "procstat_lookup_pid_count"
  namespace = "CWAgent"
  period = "300"
  statistic = "Average"
  threshold = "1"
  alarm_description = "Checks for the presence of an nginx-master process"
  alarm_actions = [aws_sns_topic.pagerduty_standard_alarms.arn]
  insufficient_data_actions = []
  treat_missing_data = "missing"
  dimensions = {
    "AutoScalingGroupName" = "some-ASG-YXI8VDT6MBE3"
    "ImageId"       = "some-ami"
    "InstanceId"    = "some-instance-id"
    "InstanceType"  = "t3a.large"
    "pattern"       = "nginx: master process /usr/sbin/nginx"
    "pid_finder"    = "native"
  }
}

The alarm creation proved to be a lot harder than I had expected, taking up several hours. I had to re-create things in my lab setup twice and do a Terraform import. The problem turned out to be that the dimensions{} block is not optional despite what the Terraform docs say. Had it said the fields were all required I probably would have saved days of time.

Polish Work

In the process of working things out I hard coded a lot of values in the Dimensions {} block. Naturally that is not good practice, especially with IaaS so I will need to rework it to use variables instead. Also the alarm names should utilize the Terraform workspace values for better naming.

Terraform – Reference parent resources

Sometimes things get complicated in Terraform, like when I touch it and make a proper mess of the code. Here is a fairly straight forward example of how to reference parent resources in a child.

├── Child
│   └── main.tf
└── main.tf

1 directory, 2 files
$ pwd
/Users/dword4/Terraform

First lets look at what should be in the top level main.tf file, the substance of which is not super important other than to have a rough idea of what you want/need

provider "aws" {
  region = "us-east-2"
  profile = "lab-profile"
}

terraform {
  backend "s3" {}
}

# lets create an ECS cluster

resource "aws_ecs_cluster" "goats" {
  name = "goat-herd"
}

output "ecs_cluster_id" {
  value = aws_ecs_cluster.goats.id
}

What this does simply is create an ECS cluster with the name “goat-herd” in us-east-2 and then outputs ecs_cluster_id which contains the ID of the cluster. While we don’t necessarily need the value outputted visually to us, we need the output because it makes the data available to other modules including child objects. Now lets take a look at what should be in Child/main.tf

provider "aws" {
  region = "us-east-2"
  profile = "lab-profile"
}

terraform {
  backend "s3" {}
}
module "res" {
  source = "../../Terraform"
}
output "our_cluster_id" {
  value = "${module.res.ecs_cluster_id}"
}

What is going on in this file is that it creates a module called res and sources it from the parent directory where the other main.tf file resides. This allows us to reference the module and the outputs it houses, enabling us to access the ecs_cluster_id value and use it within other resources as necessary.

Selecting an AWS subnet by name in Terraform

One of my recent challenges has been to write tf code to select existing subnets and use them in new blocks of code (specifically in this case to create a Directory, Workspaces and add a few Security Group entries). Since I am relatively new to using Terraform to do this it took far longer to figure out than I would care to say and I figured it would be best to document what finally worked and had the concept click for me in my mind.

provider "aws" {
  region = "us-east-1"
}
variable "subnet_name" {
  default = "workspaces-private-us-east-1c"
}
data "aws_subnet" "selected" {
  filter {
    name = "tag:Name"
    values = ["${var.subnet_name}"]
  }
}

output "vpcid" {
  value = "${data.aws_subnet.selected.vpc_id}"
}

output "subnet_name" {
  value = "${var.subnet_name}"
}
output "subnet_id" {
  value = "${data.aws_subnet.selected.id}"
}

This will look up the named subnet “workspaces-private-us-east-1c” and obtain not only the VPC ID associated with it but the unique subnet id as well, the output should look something like the below sample provided the name you are looking up is unique

data.aws_subnet.selected: Refreshing state...

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Outputs:

subnet_id = subnet-0299e079c90b20ea6
subnet_name = workspaces-private-us-east-1c
vpcid = vpc-04066bef0a56ebcc2

This is of course specific to things as of Terraform 0.12.20 and provider.aws 2.48.0 so naturally things may change over time, however this will get you close and provide you enough of a starting point to use these subnets in other things.

Close Bitnami banner
Bitnami