Running scheduled tasks on AWS Elastic Beanstalk

in Technology on October 10, 2015

If you use Elastic Beanstalk to run and manage your web apps, at some point you’ll want to setup some scheduled tasks, or cronjobs. Today’s blog post aims to take you through the best way of achieving this, whether you’re running in single instance mode, or load balancing.

Scheduled tasks are typically run overnight when load on your application is low, and are used for all sorts of things: everything from system maintenance (like database tidy-ups) to more compute-intensive jobs that may take several minutes to complete.

At Rotaready, we have many scheduled tasks that run at various times of the day and night. If you’re a recipient of our weekly rota digest SMS (the one that tells you when you’re working each week) then you’ve been contacted by one of our scheduled tasks!

Without further ado, I’m going to dive into the technical details and take you through a couple of ways to do this, depending on how your Elastic Beanstalk environment is configured.

Scheduling tasks in a Single-instance Environment

In this mode, a single EC2 instance is provisioned for you and there’s no load balancing, regardless how busy your app gets. It’s cheap and ideal for a development environment or for production apps that get low traffic. It’s likely you’ll start out like this and look towards a load-balancing autoscaling environment when your app gets popular and you need to scale.

Seeing as there will only ever be one running instance of your app in this environment, setting up scheduled tasks is quite straightforward. In fact, there’s two ways to do it:

With .ebextensions
  1. If it doesn’t already exist, create a folder at the root of your application called .ebextensions
  2. Create a config file inside that folder, here we’ll call it cron.config
  3. Add the following text to the file.
    Note that this is YAML configuration file. You must preserve the spaces at the beginning of each line or it won’t parse. Tabs aren’t allowed either.

    container_commands:
      01_remove_crontab:
        command: "crontab -r || exit 0"
      02_add_crontab:
        command: "cat .ebextensions/crontab | crontab"
    
  4. Then create a second file in your .ebextensions folder, this time called crontab
  5. Add the following text to the file:
    00 00 * * * /usr/local/bin/node /home/example/script.js
    

So what’s happening in step 3? As described here, commands are processed in alphabetical order, so we’re prefixing our command names with 01 and 02 to ensure they execute in this order. The first command wipes any pre-existing crontab, and the second adds our crontab file to the machine.

And what about step 5? In this example, I’m using Node.js to execute a JavaScript file every night at midnight. This article clearly describes what the 0’s and *’s do and how to structure a command. Customise to your liking!

Note! You must have a blank line at the end of your crontab file or it won’t run.

With a package (language-dependent)

If you’re using Node.js, try the brilliant node-schedule package. It allows you to do cron-style scheduled tasks in code, and there’s no need to mess about with any config like the previous example.

schedule = require('node-schedule');

schedule.scheduleJob('0 0 * * *', function () {
    // Do stuff here
});

Scheduling tasks in a Load-balancing, Autoscaling Environment

This is where things get interesting. If you converted your single-instance environment to one that auto scales, Elastic Beanstalk will automatically spin up additional instances of your app in response to higher demand. This is good stuff and exactly what we want, but what effect will this have on our scheduled tasks?

Let’s say you have a task scheduled to run at midnight every night. Imagine that it’s nearly midnight right now and there’s sufficient demand on your app for there to be 3 running instances. Your task will be run three times, once by each instance! This isn’t ideal.

So how do we fix it?

If you use the .ebextensions example for single-instance environments (detailed above), there’s an extra property that can be added to the config file called leader_only. It has been suggested that adding this property and setting it to true will ensure only the designated ‘leader instance’ (in your auto scaling group) will run the commands, and therefore your scheduled tasks will only be run once. Even the official docs allude to this too.

It turns out, however, that this doesn’t work as expected. The leader instance can be terminated, leaving you with nobody to run your tasks. An instance is nominated as the leader at deployment, and this is the shortcoming. This sounds silly, but let me explain:

Imagine you had one running instance (this is designated as the leader). Suddenly, load on your app spikes, Elastic Beanstalk spins up a second instance (not the leader), sweet. Only the leader will run your tasks. But now imagine load drops. Elastic Beanstalk realises you don’t need two running instances and terminates one. There’s a chance it’ll choose the first running instance to terminate (in fact I think it favours this choice), and not the second. And if that happens, you’re left without a leader.

So how do we fix that?

This is turning into a bit of a headache. But fear not, I have a solution. I ignore the .ebextensions config-style approach entirely, and deal with things in code instead:

var logger = require('../log'),
    async = require('async'),
    http = require('http'),
    AWS = require('aws-sdk');

AWS.config.update({region: 'eu-west-1'}); // change to your region
var elasticbeanstalk = new AWS.ElasticBeanstalk();

function runTaskOnMaster(name, taskToRun, callback) {
    logger.info('Beginning task: ' + name);

    async.waterfall([
        function (callback) {
            var options = {
                host: '169.254.169.254',
                port: 80,
                path: '/latest/meta-data/instance-id',
                method: 'GET'
            };

            var request = http.request(options, function (response) {
                response.setEncoding('utf8');
                var str = '';

                response.on('data', function (chunk) {
                    str += chunk;
                });

                response.on('end', function () {
                    callback(null, str);
                });
            });

            request.on('socket', function (socket) {
                socket.setTimeout(5000);
                socket.on('timeout', function() {
                    request.abort();
                });
            });

            request.on('error', function (e) {
                callback(e);
            });

            request.end();
        },

        function (currentInstanceId, callback) {
            var params = {
                // Note! You'll need to set this env variable in AWS to the name of your environment
                EnvironmentName: process.env.AWS_ENV_NAME
            };

            elasticbeanstalk.describeEnvironmentResources(params, function (err, data) {
                if (err) return callback(err);

                if (currentInstanceId != data.EnvironmentResources.Instances[0].Id) {
                    callback(null, false);
                }

                callback(null, true);
            });
        },

        function (isMaster, callback) {
            if (!isMaster) {
                logger.warn('Not running task as not master EB instance.');
                callback();
            } else {
                logger.info('Identified as master EB instance. Running task.');
                taskToRun(callback);
            }
        }
    ], function (err) {
        if (err) {
            logger.error('Error occurred during task.', err);
        } else {
            logger.info('Successfully finished task: ' + name);
        }

        callback();
    });
}

This example is in Node.js but you could rewrite it in almost any language, as it uses the AWS SDK (which is available in most of the popular languages/platforms). I call the runTaskOnMaster() function on all my instances using node-schedule, but you could easily call it from crontab instead.

So how does it work?

There’s a handy little Instance Metadata web service that runs on all EC2 instances. It’s available via an IP address (169.254.169.254) or a hostname (instance-data). I use it to get the Instance ID of the machine.

I then use the AWS SDK to ‘describe environment resources’ for a given Elastic Beanstalk environment (in my example I pass this in as an environment variable, but you could hard code it in). This returns me a list, amongst other things, of all the running instances in the environment. We know there will always be at least one instance in this list, so I simply check if the instance that’s running this code has the same ID as the first instance in the list. And if it is, I deem that to be the master/leader, and we run the scheduled tasks. Boom!

Note! You’ll need to grant the IAM role (that your instances run under) some extra permissions for this to work. The actions to add to a new/existing policy are:

  • elasticbeanstalk:DescribeEnvironmentResources
  • autoscaling:DescribeAutoScalingGroups
  • autoscaling:DescribeAutoScalingInstances
  • cloudformation:ListStackResources (…possibly optional!)
In summary

I’ve found this approach works great; only one instance ever runs my scheduled tasks, regardless of how many times my app has been scaled up and down and regardless what instances were terminated.

As an improvement, we could sort the Instances list returned from the SDK. While I’ve always found it to always have consistent ordering, this would be a sensible thing to do.

Do let me know your thoughts, whether you’ve found any faults in this method or if you’ve found an even easier way to do it!

14 thoughts on “Running scheduled tasks on AWS Elastic Beanstalk”

  1. This is an interesting post. The reason of having this approach is because you don’t want to setup a worker environment with ElasticBeanstalk ?

    1. Hey Nam, thanks for the comment.

      Yes, to some extent! Having your cron jobs run in a dedicated Worker Environment is a great approach – one I already use for CPU intensive processing.

      However, I have a few little jobs that utilise lots of common functionality in my API’s codebase. To have them run in a worker environment I suppose I’d have to copy a lot of that code into a separate codebase, or into some sort of common library. I’ll probably end up doing that, but having them run within the API itself (but only on one instance) felt like a reasonable compromise for the time being.

  2. Carl, very useful for me. I like the method you used.
    A couple updates to the code of yours:
    + You have to get the elasticbeanstalk object by calling: var elasticbeanstalk = new AWS.ElasticBeanstalk(); before you do .describeEnvironmentResources(…)

    + Have to set the region for the SDK through: AWS.config.update({region:’us-east-1′}); // us-east-1 for example

    + And you need another policy added to the IAM: cloudformation:ListStackResources

    Thanks!

    1. Hey Mohammed, glad you found the post useful!

      Many thanks for bringing those mistakes to my attention – I’ve updated the article.
      I included the extra IAM policy but I added a little note so say it’s optional as I don’t think I needed it – guessing it depends on how your AWS environment is configured…

  3. Carl,

    Have you tried the cron.yaml that can be used on EB ? Here is the quote from AWS docs on leader election.

    “Elastic Beanstalk uses leader election to determine which instance in your worker environment queues the periodic task. Each instance attempts to become leader by writing to a DynamoDB table. The first instance that succeeds is the leader, and must continue to write to the table to maintain leader status. If the leader goes out of service, another instance quickly takes its place.”

    http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html#worker-periodictasks

    Is this mechanism of leader election flawed inside AWS ? I am contemplating using it and wanted to know your experience on it.

    1. Hey Kris, cheers for the comment!

      I think we’re crossing wires slightly. This blog post is about how to run scheduled tasks on one instance of an auto-scaling Web Server Environment. What you’re referring to is a Worker Environment.

      And if you’re running an auto-scaling worker environment, what you’ve linked to looks perfect for the job! I haven’t tried it myself but after reading through the docs it appears they’ve put together a great solution there. I’d definitely go down that route – running your scheduled tasks in a dedicated worker environment is much better than running them on your web server (which may become unresponsive to HTTP requests if you have CPU-heavy or long running tasks).

      However, if you have just a few periodic tasks that only take a few seconds to execute, don’t want the added cost of running a worker environment 24/7, and want to utilise the same codebase your web server runs off, then my solution is a pretty good fit. Time-triggered AWS Lambda functions could also do a great job at that too.

      1. Hey ippei,

        Great spot! AWS_ENV_NAME isn’t standard, I’ll update the article to mention that. I like your solution a lot, especially the use of ES6. We’re ES6-ifying lots of our code as we go along but it’s a big task.

        I hope others find your solution useful too.

  4. It seems scaling up using the .ebextensions solution is solved by adding the leader option. Scaling down is the issue as it could delete you leader server. Would enabling termination protection on this first server prevent this from happening?

  5. Hi everyone, have you ever try lambda functions + cloudwatch approach? You could set up a schedule rule in CloudWatch and then targeted it to any lambda function you want.

  6. Great Article Carl.

    I am facing the same problem as expected, leader server gets terminated and the cron jobs are not running anymore. Is there any possibility to fix this issue without any codes and updating IAM policy.

    Thanks for your info.

  7. How do you upload the script.js file to /home/example/script.js or can we specify the path relative to our app’s root folder like ./script.js ?

Leave a Reply

Your email address will not be published. Required fields are marked *