Last week, Amazon Web Services announced their latest hardware device aimed at Machine Learning education: the DeepRacer, an autonomous model car, with an accompanying simulator environment and Reinforcement Learning toolkit.
Also launched at re:Invent was the official AWS DeepRacer League, a racing league for developers to compete with their DeepRacer models for the DeepRacer cup. The league takes place throughout 2019, but for 2018 the league took place over 48 hours during the conference.
All while the racing was taking place at the MGM Grand Garden Arena, racers’ times were being recorded, tallied, and displayed throughout the conference, with the winners competing in the Cup Final before Werner’s keynote.
AWS tasked us with building a leaderboard for recording all the race times set by competitors through the conference and displaying the leading entries in real-time on screens all through the venue, along with a mobile interface on which competitors could see the leaderboard and search for their own time.
We took all our prior knowledge of running real-time systems in the cloud and decided to build the whole system for the web, making use of AWS systems and services wherever possible.
Here’s how we did it.
The core of the system is very simple: the main data-store is an Aurora MySQL database, containing records of competitors and their racing times, which is updated via a Lambda function invoked through API Gateway. All the leaderboard screens throughout the venue connect to AWS IoT Core via MQTT-over-WebSockets and listen for updates on the “leaderboard” Thing. When a new time is recorded, the Lambda queries the latest set of top times and publishes it to the Thing Shadow, which is pushed out to all connected clients in real-time. Then there’s another Lambda function (again behind API Gateway) which can be called by the mobile UI for users searching for their own entries.
The front-end of the Leaderboard is a simple React application – when a new set of state data is pushed to the Thing Shadow, it’s received over the WebSocket and the React tree is re-rendered.
When building the leaderboard it was important to make sure that whatever we built would be able to meet demand. Predicting that demand was somewhat tricky – we knew that there would be a very small number of people using the timekeeping interface to submit new entries; we knew roughly how many times the main leaderboard UI would be shown on screens at the conference, but we had very little idea how many people would be using the mobile interface. All we knew was that at some point during the conference, the main re:Invent app would be updated to show a prominent link to all attendees.
That meant we had to be able to scale anywhere up to about 50,000 people using the mobile interface at any one moment. In reality it didn’t reach quite those numbers, but we knew we needed to be able to handle a large spike in traffic at very little notice.
In true spirit of the AWS cloud, we use as many AWS services as possible to ensure we wouldn’t have to worry about scaling issues in most places. API Gateway, IoT and Lambda can scale pretty much infinitely with no special intervention or auto-scaling without breaking a sweat. The only part of our system that would need to scale is the database, living on Aurora.
For various reasons we weren’t able to use Aurora Serverless (see below), but since we only needed to scale reads from the database, it was a simple matter of making sure we had a strategy for scaling Read Nodes in the Aurora cluster should the need arise. We could have added an Auto Scaling config to add Read Nodes automatically, but for simplicity we decided to keep the process manual – we were on-site and monitoring graphs all the time anyway, and if load was looking too high we could easily throw more Read Nodes into the cluster until it was happy again.
As it happens, we needn’t have worried about scaling the database – even on a tiny RDS instance, CPU load barely registered above 10% even at peak load. And all the other parts of the infrastructure scaled completely invisibly without our intervention.
Lambda, Lambda everywhere
The leaderboards themselves fetch all their data from the leaderboard Thing via AWS IoT. For all the other data fetching (e.g. in the mobile app where you can see more than the top 20, or filter results by a search query), submission of new entries, and general administration, the manipulation is done via Lambda functions invoked via API Gateway.
There are three Lambda functions in the backend:
The “fetch entries” function
This function is called by the mobile UI, which needs to send a query to filter the returned results. This function queries the MySQL database to fetch the results requested
The “submit new entry” function
This function is called by the timekeeping UI when submitting a new entry time. It inserts the details for the competitor as well as their race time, then fetches the current state of the top 20 entries, and publishes that updates to the Leaderboard Thing Shadow
The “admin” function
This function serves several different API endpoints used by the AWS team in charge of the league. It can edit the time for an entry already submitted, remove an entry that should be disqualified, and fetch more details on the competitors such as contact details.
These three Lambda functions are all built using
aws-serverless-express, a fantastic node module which makes it incredibly easy to build APIs on Lambda and API Gateway using the
express framework, which in turn makes it easy to serve several different API endpoints from the same Lambda function without creating a function for each (e.g. the “admin” function serves
/admin/recentEntries as well as
Since all these Lambda functions serve different purposes and deal with different levels of data sensitivity, we needed to set up some kind of permission system to ensure people couldn’t mess around with it.
Security and authentication
A racing leaderboard wouldn’t be very good if anybody could just find the URL for the time-entry page and submit their own times to put themselves at the top of the rankings. To protect against this, we set up a Cognito User Pool and Identity Pool, which gives us a user/password system, and allows us to specify an IAM Role to assume when a user logs into the time-entry page. Then, we used IAM authentication and permissions on the “submit new entry” route in API Gateway to make sure it could only be called using that particular IAM role.
Since the race officials needed to be able to contact the winning entrants to recall them for the Cup Final, the time submission page also stores a phone number and email address for all competitors. This is sensitive personal information, so we needed to ensure that nobody had access to this except the AWS team in charge of the race – not even the officials operating the timekeeping system at the stadium. To ensure this, we added another IAM Role, which could only be assumed by the
admin user in our Cognito User Pool. This role then has permission to invoke a separate admin endpoint in the API Gateway which returns this more sensitive data.
So, all told, we have three IAM roles – a default, unauthenticated role, which is used by the leaderboards themselves and the mobile app for reading the current state of entrants and subscribing to updates via IoT; a basic authenticated role used by the timekeepers to submit new entries; and an admin role which can access the most sensitive data, and also submit after-the-fact edits to already-submitted race times.
Each of the three Lambda functions detailed above also runs with a separate Execution Role and MySQL user, which dictates exactly what it’s allowed to access. This means that (in theory) if there was any sort of exploitable software vulnerability in our Lambda code, a bad actor wouldn’t be able to access any more data or perform any more actions than they already have access to (for example, the “fetch” Lambda cannot publish to IoT; it uses MySQL credentials which give it read-only access to the data, and only those columns which are available for public consumption)
Putting it all together
The whole Leaderboard backend is defined in a single CloudFormation template. This makes it super-easy for us to keep track of changes, deploy multiple versions (for dev and test), and to quickly deploy in new regions (we’re based in London so we did all of our original development in the eu-west-2 London region, but for production we span up a new stack in us-west-2 Oregon to be geographically closer to the venue in Las Vegas). It also meant that whenever we ran into troubles or something we didn’t expect (more on that below), we didn’t have to remember all the steps we took to figure it out once we wanted to deploy new versions – it was all right there in the template code.
I won’t pretend that the development process was easy from beginning to end and went without a hitch, because that wouldn’t be true. There are plenty of quirks and things you might not expect when building things on AWS, especially when you’re trying to hook together several different services and moving parts. Here’s just a few, and the steps we had to take to solve them:
- When connecting to the IoT Device Gateway, you use an account-specific endpoint, which looks something like
a1234567890abc.iot.us-west-2.amazonaws.com. For compatibility and legacy reasons, this endpoint uses an SSL certificate that’s no longer trusted by most major browsers, so requests it to simply fail with a security error. There is a separate endpoint, formatted the same, but with
.iot., which is signed by the Amazon Trust Store, but this isn’t the default endpoint returned by IoT’s DescribeEndpoint API. You need to use the
endpointType=iot:Data-ATSparameter to request this up-to-date endpoint, or remember to substitute in the
-atsif you want anything to work in modern browsers
- In order to lock down access to our Aurora database, we gave it a VPC security group and locked down port 3306. However, in order to grant access only to our Lambda functions, that meant we had to run the functions inside our VPC, which means that they have very long cold-start times. In addition, in order to access the IoT endpoint, those functions needed to have access to the public internet, which meant we had to add a NAT Gateway (and corresponding route tables) into our VPC
- Since our setup scripts needed to be able to query the Aurora database and set up the schema with all the correct tables and columns, we needed the database to be accessible from the public internet. We could have spun up an EC2 instance or a Lambda function to run queries directly from within our VPC but we decided that added unnecessary complication for this project. What it did mean, however, was that we weren’t able to use Aurora Serverless, since it’s not possible to set up with public internet access. Today we would be able to solve this with the new Aurora Serverless Data API, but this wasn’t available when we started building the backend.