One Cloud Please

Bucketsquatting is (Finally) Dead

2026-03-13T00:00:00+00:00

For a decade, I have been working with AWS and third-party security teams to resolve bucketsquatting / bucketsniping issues in AWS S3. Finally, I am happy to say AWS now has a solution to the problem, and it changes the way you should name your buckets.

What is Bucketsquatting?

Bucketsquatting (or sometimes called bucketsniping) is an issue I first wrote about in 2019, and it has been a recurring issue in AWS S3 ever since. If you’re interested in the specifics of the problem, I recommend you check out my original post on the topic: S3 Bucket Namesquatting - Abusing predictable S3 bucket names. In short, the problem is that S3 bucket names are globally unique, and if the owner of a bucket deletes it, that name becomes available for anyone else to register. This can lead to a situation where an attacker can register a bucket with the same name as a previously deleted bucket and potentially gain access to sensitive data or disrupt services that rely on that bucket.

Additionally, it is a common practice for organizations to use predictable naming conventions for their buckets, such as appending the AWS region name to the end of the bucket name (e.g. myapp-us-east-1), which can make it easier for attackers to guess and register buckets that may have been previously used. This latter practice is one that AWS’ internal teams commonly fall victim to, and it is one that I have been working with the AWS Security Outreach team to address for almost a decade now across dozens of individual communications.

A new namespace

To address this issue, AWS has introduced a new protection that works effectively as a “namespace” for S3 buckets. The namespace syntax is as follows:

---an

For example, if your account ID is 123456789012, your prefix is myapp, and you want to create a bucket in the us-west-2 region, you would name your bucket as follows:

myapp-123456789012-us-west-2-an

Though not explicitly mentioned, the -an here refers to the “account namespace”. This new syntax ensures that only the account that owns the namespace can create buckets with that name, effectively preventing bucketsquatting attacks. If another account tries to create a bucket with the same name, they will receive an InvalidBucketNamespace error message indicating that the bucket name is already in use. Account owners will also receive an InvalidBucketNamespace error if they try to create a bucket where the bucket region does not match the region specified in the bucket name.

Interestingly, the guidance from AWS is that this namespace is recommended to be used by default. Namespaces aren’t new to S3, with suffixes like .mrap, --x-s3, and -s3alias all being examples of existing namespaces that AWS previously used for new features; however, this is the first time AWS has introduced a namespace that is recommended for general use by customers to protect against a specific security issue.

It is AWS’ stance that all buckets should use this namespace pattern, unless you have a compelling reason not to (hint: there aren’t many). To this end, AWS is allowing security administrators to set policies that require the use of this namespace through the use of a new condition key s3:x-amz-bucket-namespace, which can be applied within an Organization’s SCP policies to enforce the use of this protection across an organization.

This doesn’t retroactively protect any existing buckets (or published templates that use a region prefix/suffix pattern without the namespace), but it does provide a strong protection for new buckets going forward (okay, so it’s dying, not dead). If you wish to protect your existing buckets, you’ll need to create new buckets with the namespace pattern and migrate your data to those buckets.

What about the other cloud providers?

While AWS has introduced this new namespace protection for S3 buckets, the other major cloud providers handle things slightly differently.

Google Cloud Storage already has a namespace concept in place for its buckets, which is based on domain name verification. This means that only the owner of a domain can create buckets with names that are of a domain name format (e.g. myapp.com), and they must verify ownership of the domain before they can create buckets with that name. Bucketsquatting is still possible with non-domain name formatted buckets, but the use of domain name formatted buckets is Google’s solution to the issue.

For Azure Blob Storage, storage accounts are scoped with a configurable account name and container name, so the same issue does apply. This is further exacerbated by the fact that Azure’s storage account names have a maximum of 24 characters, leaving a fairly small namespace for organizations to work with. (h/t vhab for pointing this out)

tl;dr

There is a new namespace for S3 buckets. The namespace protects you from bucketsquatting attacks, and you should use it for any S3 buckets you create.

If you liked what I’ve written, or want to hear more on this topic, reach out to me on LinkedIn or 𝕏.

MistakenVMtity: Another cloud image confusion attack

2025-03-10T00:00:00+00:00

Last month, Seth Art from Datadog Security Labs published an excellent post on AWS cloud image confusion attacks. In this post, I’ll explain how Azure has a similar issue with its CLI.

If you haven’t seen the Datadog Security Labs post, I highly recommend you check it out. It’s a great read and provides a lot of context for the issue I’ll be discussing here. They do have the better title pun though.

Image confusion attacks

When provisioning virtual machines within the cloud, users typically specify an image to use as the base for the VM. This image is often referred to by a name or ID. In the case of AWS, the image is referred to as an Amazon Machine Image (AMI) and is identified by an AMI ID. In Azure, the image is referred to as a Virtual Machine Image and is identified by a URN which is comprised of a combination of a publisher name, an offer name, a SKU, and a version, all concatenated by a colon (e.g. Canonical:ubuntu-24_04-lts:server:24.04.202502210).

An image confusion attack occurs when an attacker is able to create an image with a name that matches the search or filter criteria that a user is using to select their intended image. This can lead to the attacker’s image being selected instead of the legitimate image. An attacker will generally create an image that acts just like the legitimate image, but with some additional functionality that can be used to compromise the user’s environment with remote code execution, data exfiltration, or other malicious activities. In the AWS example, this was done using the AWS CLI command aws ec2 describe-images and Terraform data providers which performed a search for images based on the name or partial name of the image, which could include the attacker’s image.

The GitHub example

In 2023, I was looking at how GitHub advised deploying its GitHub Enterprise Server offering on Azure. The documentation at the time advised using the Azure CLI to determine the latest version of the GitHub Enterprise Server image as follows:

$ az vm image list --all -f GitHub-Enterprise | grep '"urn":' | sort -V

This command would list all the images available in Azure that had an offer name of “GitHub-Enterprise” and then sort them by version number. The user could then select the latest version of the image to use for their deployment. Notably, the command did not filter by publisher name or SKU, only by offer name. This meant that an attacker could create an image with the offer name “GitHub-Enterprise” under their separate publisher identifier and have it appear in the list of images returned by the command. Publisher identifiers are unique in Azure, but not offers, SKUs or versions.

In Azure, to register an image which has a public URN, you list your offering on the Azure Marketplace via the Azure Partner Center. After some KYC checks, you can register any arbitrary publisher identifier. In my case, I registered “ghes” for GitHub Enterprise Server.

I then created an offer with the version number of “99.99.99” to ensure my image would appear as the latest image in the list.

I also selected the option to hide the plan from the Azure Marketplace UI, which would prevent users from more clearly identifying the difference.

This specific offer was not fully published to the Azure Marketplace to avoid direct customer impact to GitHub customers and was instead reported to GitHub. Though GitHub stated that these findings “do not present a significant security risk”, they have since updated their documentation to use a specific filter for the GitHub Enterprise Server image, as follows:

az vm image list --all -f GitHub-Enterprise | grep '"urn": "GitHub:' | sort -V

This change specifically filters the images by the publisher name “GitHub” and the offer name “GitHub-Enterprise”. If you are a provider looking to avoid this issue, I would recommend you follow this pattern in your documentation, or alternatively provide a full list of URNs for your users to select from.

An extra step needed

In my testing of Marketplace publication, I found that when executing a deployment of my free marketplace VM image using az vm create, Azure would initially reject my request to deploy the image. This was because the terms of the Marketplace image was not yet “accepted”.

The user would be required to execute az vm image accept-terms or az vm image terms accept to accept the terms of the image before the deployment could proceed. I found this to be initially confusing as images like the base Ubuntu image or the GitHub Enterprise Server image did not require this step. After some investigation and a support ticket, Microsoft confirmed this was an undocumented trait of certain images in the Azure Marketplace. Microsoft stated:

The GitHub Enterprise Server offering [sic] is a 1PP product (Core Virtual Machine) and not an Azure Virtual Machine(3PP) which are created by 3PP Publishers in-fact Marketplace Partners. Not all the partners in marketplace are allowed to create the 1PP offer and only few approved Marketplace Partners are allowed to create 1PP VM offers. And in the 1PP marketplace offers will be auto accepted the terms and conditions.

This limits the attack surface of this image confusion attack for Azure, as users would need to accept the terms of the image before deploying it, however many Marketplace images do have the requirement to accept terms before deployment.

The partial search issue

Those of you with keen eyes will notice that the updated image search command for the GitHub example uses grep to filter the publisher of the image and not the --publisher -p argument that exists for the az vm image list command. In fact, the use of the --publisher flag is what many publishers such as F5, AlmaLinux and even at one point Canonical advise their users do to find the latest images for their offerings.

Using only the CLI-provided flags however makes the results still susceptible to the above attack as the --publisher flag, as well as the --offer and --sku flags, are wildcarded by default. This means that if you were to register a publisher with a name that starts with the intended target publisher name, you could still have your image appear in the list of images returned by the command.

This is the reason why the updated GitHub command uses grep to filter the publisher name.

The partial search seems to only be an issue with specifically the az vm image list command. Other commands such as az vm create or az vm image accept-terms do not have this issue and instead seem to directly concatenate the provided publisher, offer and SKU to form the URN. The same seems to be the case for most Terraform plans as the term latest can be used in lieu of a version number to deploy the latest image, negating the need for a search data provider.

Working as intended?

Similar to the official response from AWS, I believe most providers will consider this to be working as intended. The Azure CLI is a tool that is designed to be used by administrators and developers who are expected to have a certain level of knowledge about the resources they are working with and the burden of ensuring the publisher is correct would generally fall on the user.

However, as we have seen with the GitHub example, this can lead to confusion and potential security risks. Azure removing the partial wildcard nature within the az vm image list command would mitigate this risk but this would likely be too much of a breaking change to be considered by the Azure team.

If you liked what I’ve written, or want to hear more on this topic, reach out to me on 𝕏 at @iann0036.

Resource Control Policies: Closing the data perimeter gap

2024-11-17T00:00:00+00:00

It’s pre:Invent season, and one of the most consequential identity and access management features was just released by the identity team at AWS. Resource Control Policies, a strong tool for establishing data perimeters, is now available for organization administrators.

This post explores this new feature, how it helps, what its limits are, and what we might see in the future.

Intro to RCPs

Resource Control Policies, or RCPs, is a feature available in AWS Organizations that allows you to control the maximum permissions allowable to certain resources or resource types for accounts within your organization.

Like Service Control Policies (SCPs), RCPs are permission policies which represent a boundary of maximum permissions that can be applied within an account. This means that RCPs are policies which cannot grant authority for a certain action and can only deny actions from taking place. This makes it a tool that is likely to be used by organizational administrators who wish to establish strong controls for a data perimeter around sensitive resources within their organization.

To put it in other words, whilst an SCP statement could be described as:

despite what the policy on the identity says, the following action is not permitted

An RCP statement could similarly be described as:

despite what the policy on the resource says, the following action is not permitted

Building an effective perimeter

In order to build an effective data perimeter, administrators need to enforce the use of trusted identities, expected networks, and known resources. RCPs assist in enforcing organization-wide compliance with ensuring resources can only be accessed by trusted identities, and only via expected networks. The data perimeter adds an additional coarse-grained layer of protection to the existing practices of fine-grained protections, applied via least privilege role-based access control, network firewalls and resource policies.

Trusted identity enforcement

Let’s take a look at how to apply an RCP to ensure only identities within your organization may access the sensitive resources or data that lies within your accounts. The following policy can be used to ensure that sensitive material from S3, SQS, KMS and Secrets Manager cannot be accessed by identities outside of your organization:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "NoAccessOutsideOrg",
            "Effect": "Deny",
            "Principal": "*",
            "Action": [
                "s3:*",
                "sqs:*",
                "kms:*",
                "secretsmanager:*",
                "sts:*"
            ],
            "Resource": "*",
            "Condition": {
                "StringNotEqualsIfExists": {
                    "aws:PrincipalOrgID": ""
                },
                "BoolIfExists": {
                    "aws:PrincipalIsAWSService": "false"
                }
            }
        }
    ]
}

The effect of the policy is that any API call to these services must originate from an identity within your organization, or be on behalf of an AWS service. Additionally, outside principals cannot use STS to assume an identity within the organization to bypass the block. If a user within the organization attempts to, for example, allow s3:GetObject to an external account via an S3 Bucket Policy, the external account would still be forbidden from accessing objects within the bucket as the RCP will override the allow with its explicit deny.

Those with a keen sense of potential exploits may see the carve out for AWS services and remember the confused deputy problem as a potential problem. Thankfully, RCPs also have an answer to this in the form of enforceable confused deputy protections. We can add the following statement to our RCP to guard against this potential:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EnforceConfusedDeputyProtection",
            "Effect": "Deny",
            "Principal": "*",
            "Action": [
                "s3:*",
                "sqs:*",
                "kms:*",
                "secretsmanager:*",
                "sts:*"
            ],
            "Resource": "*",
            "Condition": {
                "StringNotEqualsIfExists": {
                    "aws:SourceOrgID": ""
                },  
                "Null": {
                    "aws:SourceAccount": "false"
                },
                "Bool": {
                    "aws:PrincipalIsAWSService": "true"
                }
            }
        }
    ]
}

The above statement applies specifically when the calling principal is an AWS service, and enforces that the aws:SourceOrgID must be equal to your organization ID (that is, the AWS service is using a principal to access the resource on behalf of another resource that belongs to your organization). The use of aws:SourceAccount is used in the Null condition operator so that the control applies only when the request has the context of an originating account (i.e. is susceptible to the cross-service confused deputy problem).

Expected network enforcement

We can also use RCPs to ensure that access is only granted from expected networks and that data doesn’t traverse through an unexpected network path. The following policy can be used to ensure data from S3, SQS, KMS and Secrets Manager can only be accessed if the caller is within the corporate network:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EnforceNetworkPerimeter",
            "Effect": "Deny",
            "Principal": "*",
            "Action": [
                "s3:*",
                "sqs:*",
                "kms:*",
                "secretsmanager:*",
                "sts:*"
            ],
            "Resource": "*",
            "Condition": {
                "NotIpAddressIfExists": {
                    "aws:SourceIp": ""
                },
                "StringNotEqualsIfExists": {
                    "aws:SourceVpc": ""
                },
                "BoolIfExists": {
                    "aws:PrincipalIsAWSService": "false",
                    "aws:ViaAWSService": "false"
                },
                "ArnNotLikeIfExists": {
                    "aws:PrincipalArn": [
                        "arn:aws:iam::*:role/aws:ec2-infrastructure"
                    ]
                }
            }
        }
    ]
}

The effect of the policy is that any attempt to access the resources within these services (or use STS to assume a role to do so) is blocked where the caller’s IP address falls outside the expected CIDR range or originates from a VPC ID that isn’t the expected one. Again, we specifically carve out an exception for AWS services, including those which use forward access sessions. We also have an additional carve out for EBS volume decryption, which uses a known IAM role to call KMS for decryption of the data key for volumes it manages.

A small note that all of the above examples don’t consider OIDC-based identities for readability purposes. Check out the aws-samples repository for a more detailed version which allows for those scenarios.

IAM Access Analyzer

With the introduction of RCPs come additions to IAM Access Analyzer’s External access finding details. Because RCPs have the ability to affect the effective permissions of a call, some of the automated findings may also be rendered invalid. To combat this without outright exposing potentially sensitive details of the RCP itself, the External access finding now has a field which indicates whether or not an RCP may affect a specific finding.

Limitations of RCPs

At launch, RCPs only support actions for S3, SQS, KMS, Secrets Manager and STS. This is a short list of likely the most impactful services for organization administrators to establish a data perimeter for. I’m confident this list will quickly expand based on customer demand.

Unfortunately, RCPs do not allow the use of the * wildcard by itself in the Action field, but instead enforce that all actions need to be scoped to a service namespace. This disallows a kind of automatic opt-in to protections as they become available via RCPs. RCPs also do not support the NotPrincipal element or the NotAction element.

Like SCPs, RCPs also do not apply to the organization management account. Administrators should ensure extra security is applied to this account to compensate. RCPs do however apply to delegated administrator accounts.

RCPs do not apply to services which use service-linked roles, as this would break specific requirements in order for some services to operate correctly. These roles do however fall directly in the AWS side of the Shared Responsibility Model.

Finally, RCPs do have limits and quotas which are very similar to SCPs, including a 5kb policy limit and a limit of 5 policies at each organizational OU, account or root level.

Time to start building

RCPs close a gap in the quest to better protect an organization’s sensitive data through the use of effective data perimeters by giving administrators a new tool to apply these guardrails. This does however introduce another layer of complexity which, if mismanaged, could lead to unexpected consequences such as outages. Administrators should carefully evaluate all the effects of these policies before applying them and in particular investigate specific nuances with how the various AWS services may use differing access mechanisms to reach resources.

Though service support is still limited at launch, I’d encourage administrators to explore the use of RCPs and to start using specific, limited policies to protect resources with known access patterns within their organization.

If you liked what I’ve written, or want to hear more on this topic, reach out to me on 𝕏 at @iann0036.

Poor mans MFA for AWS Client VPN

2024-07-13T00:00:00+00:00

The AWS Client VPN service is a common way to seamlessly connect users into internal networks, however administrators often need ways to ensure a heightened level of security considering the attack surface. In this post, I describe a low-tech, low-cost solution to better authenticate users using a second factor.

Client VPN authentication methods

AWS Client VPN supports connection to federated providers, either via a dedicated Active Directory integration (via AWS Directory Service) or via a SAML provider. These options are good, however often this solution is required either in an environment without federation already established or where the VPN is required on mobile devices, which doesn’t have a supported way to perform the browser-based flow. Because of this, the mutual authentication option is an easy and convenient way to get going quickly and at a low cost.

The Active Directory integration does have the ability to integrate MFA natively, using a RADIUS server, however this typically is a complex setup.

As easy as a thumbs up

The AWS Client VPN service does have the option to provide a client connect handler for the VPN endpoint. This handler is a custom Lambda function you can write to authorize or reject each new connection attempt. Typically, the intent would be to use device posture checks or username lookups from a datastore to evaluate the outcome of the attempt, however we do have a somewhat generous 30 second limit to work with. Notably, this check is in addition to the already established mutual certificate presentation, which takes place before this check is attempted.

A creative alternative solution is to make use of the Slack Bot API to prompt the user to confirm new connections. As users initiate a connection, the Lambda function is invoked and takes the Slack user identifier embedded in the common name of the issued mutual certificate, and uses the Slack Bot API to send a direct message in Slack to the user. The user doesn’t directly respond to the message however, and is instead prompted to give it a thumbs up 👍 reaction. Once the Lambda function sends the initial message, it then short polls the Slack endpoint to retrieve the reactions on its sent message. If it detects the correct reaction before the attempt times out, it responds with a successful authentication attempt.

Here’s what that looks like in practice:

Setting it up

The following assumes you have already set up a Client VPN endpoint using mutual authentication. The AWS docs do a pretty good job at walking you through this. You’ll also need appropriate permissions to install a new bot to your Slack workspace (this is typically allowed for non-administrators).

One modification to the process is to ensure you include the Slack ID of the user in the common name of the issued certificate to clients, like the following:

./easyrsa build-client-full -.mydomain.com nopass

The Slack ID for a user can be found by clicking on the users Slack profile and selecting the “Copy member ID” option in the expand menu.

Next, we’ll set up the Slack Bot itself. To do this, visit https://api.slack.com/apps and click on the “Create New App” button. Use the “From Scratch” option, give your bot a new friendly name, and select the workspace to authorize your bot into.

I highly recommend scrolling down on the initial page and adding an App Icon for your bot to help distinguish it more.

Navigate to the “OAuth & Permissions” page for the bot and scroll to the “Scopes” section. Add the scopes chat:write and reactions:read.

Once done, scroll up and click the “Install to Workspace” button. Authorize the request, navigate back to the “OAuth & Permissions” page and you should have a “Bot User OAuth Token” generated for you, starting with xoxb-.

Take the “Bot User OAuth Token” and save it to the “token” field of a new Secrets Manager secret within your AWS account. I’ve called my secret “myslackbot” here but you can use anything you wish and modify the upcoming script as needed.

The final change is to create the authorization Lambda for the client connection handler. One particularly confusing limitation is that the name of the Lambda function must be prefixed with AWSClientVPN-. Below is the full Python source code for that - no external libraries needed!

import boto3
import os
import json
import pprint
import time
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError

def handler(event, context):
    client = boto3.client('secretsmanager')
    secret = json.loads(client.get_secret_value(SecretId='myslackbot')['SecretString'])

    channel = event['common-name'].split("-").pop().split(".")[0]
    if len(channel) < 2 or len(channel) > 12:
        return

    body = {
        'channel': channel,
        'text': 'React with a :thumbsup: to this message to approve the current login attempt from ' + event['public-ip'] + ' (' + event['platform'] + ').\n\nYou must complete this action within 30 seconds.'
    }
    req = Request(
        'https://slack.com/api/chat.postMessage',
        json.dumps(body).encode('utf-8'),
        headers={
            'Content-Type': 'application/json; charset=utf-8',
            'Authorization': 'Bearer ' + secret['token']
        }
    )
    msg = json.loads(urlopen(req).read())

    while True:
        time.sleep(2)
        req = Request(
            'https://slack.com/api/reactions.get?channel=' + msg['channel'] + "×tamp=" + msg['ts'],
            headers={
                'Content-Type': 'application/json; charset=utf-8',
                'Authorization': 'Bearer ' + secret['token']
            }
        )
        reactions = json.loads(urlopen(req).read())

        if 'reactions' in reactions['message']:
            for reaction in reactions['message']['reactions']:
                if '+1' in reaction['name']:
                    return {
                        'allow': True,
                        'error-msg-on-denied-connection': '',
                        'posture-compliance-statuses': [],
                        'schema-version': 'v2'
                    }

Once you’ve configured your client connection handler in the VPN endpoint, you have completed your setup and can test your new MFA solution for yourself.

Finishing up

The above solution was the result of running into a bunch of limitations, but then looking around and considering alternatives that may seem unusual at first however turn out to be quite effective. I’m reminded that this is a good skill to have and can lead to some new experiences that might benefit you in future circumstances.

If you liked what I’ve written, or want to hear more on this topic, reach out to me on 𝕏 at @iann0036.

HTTPS Endpoints and more tricks with AWS Step Functions

2024-01-13T00:00:00+00:00

AWS re:Invent 2023 is now behind us and one of my favourite announcements was the introduction of HTTPS Endpoints to AWS Step Functions. In this post, I explain the feature, test its limits and also show off some other tricks for data manipulation within your state machines.

For the impatient, here is the final result.

HTTPS Endpoints feature

HTTPS endpoints use Amazon EventBridge API destination connections to determine the authentication mechanism used. This service subsequently uses Secrets Manager to store the credentials that will be included to authenticate requests.

Then within the state machine, you reference this connection and specify your own URL and HTTP method. You can also optionally include your own query parameters, headers and/or request body.

There are some limitations though. Firstly, there is a 60 second timeout (hard limit) for the totality of the request. There are additional mandatory headers which Step Functions sets and you cannot override. These are:

Host (value: hostname of the URL)
User-Agent (value: Amazon|StepFunctions|HttpInvoke|us-east-1, where us-east-1 is replaced by your region)
Range (value: bytes=0-262144)

Note that the request will still fail if the response exceeds 256kb even though the Range header is set. The presence of the header can also cause confusion as some servers will respond with a 206 Partial Content status code even if all data is returned, so be aware of that.

The client IP address for the requests are different for each request and appear to lie within the standard EC2 public IP range published by AWS. There is no capability to use Elastic IPs or other networking constructs within your account.

Your state machine IAM role will need to include actions that allow access to the connection and its associated secret, as well as the states:InvokeHTTPEndpoint action which has the optional conditionals of states:HTTPEndpoint and states:HTTPMethod to help scope down what endpoints and HTTP methods the state machine can call. I have included an example of a granular policy in the CloudFormation template at the end of this post.

Gathering the data

In order to demonstrate the capabilities of the new feature, I’ve chosen to consume the Chess.com API. This is a free and anonymous API which retrieves metadata about games and players on their platform.

I will retrieve a list of all grandmasters, their country of origin, and aggregate these details by country.

Because this is a public endpoint, there is no need for an Authorization or similar header when accessing the endpoint, however EventBridge API destinations require the use of Basic Authorization, OAuth or API Key header. One creative way of avoiding sending an unnecessary header is to create your connection using the API Key type but set the header to one of the immutable headers, such as User-Agent.

I created the step to gather the list of grandmasters by hitting the URL https://api.chess.com/pub/titled/GM. Because I am only interested in the content of the response body, I apply an OutputPath filter of $.ResponseBody. This provides me with the list of grandmaster usernames, but not their origin country or actual name. For that, we need to retrieve their details using additional individual HTTPS calls.

To do this efficiently, we use the Distributed Map type within Step Functions. To ensure we do not overload the Chess.com API, we limit the concurrency to 40. We also use a standard exponential backoff for the inner HTTPS call to allow for retries in the event of an occasional error.

This brings us to a state where we have an array of the individual grandmaster details.

Aggregating the data

Aggregating data (using map-reduce style methods) within a state machine is not a native function, however with some clever usage it is possible.

To do this, we first need to ensure all fields are present in the individual grandmaster details. Unfortunately, the name field isn’t always present on these responses so to fix that we add the following ResultSelector to the HTTPS endpoint step within the distributed map:

{
    "output.$": "States.JsonMerge(States.StringToJson('{\"name\":\"Unknown Player\"}'), $.ResponseBody, false)"
}

This takes the resulting detail from the HTTP response, and performs a JSON merge with the static object we defined with a default name. If the name is not present, this field will be used.

Next, we format the resulting name in the way we would like it, as well as extract the 2-letter country code from the URL which looks like https://api.chess.com/pub/country/US. To do this, we use a Pass state. The Parameters of the Pass state are as follows:

{
    "displayName.$": "States.Format('{} ({})', $.output.name, $.output.username)",
    "country.$": "States.ArrayGetItem(States.StringSplit($.output.country, '/'), 4)"
}

Note that the array index used is 4 and not 5. This is because empty segments (the one in between http:/ and the next /) get discarded during the States.StringSplit operation.

Using the output of the distributed map, we apply a new Pass state with the following parameters:

{
    "original.$": "$",
    "countries.$": "States.ArrayUnique($[*].country)",
    "countriesCount.$": "States.ArrayLength(States.ArrayUnique($[*].country))",
    "iterator": 0,
    "output": {}
}

The original key contains the distributed map output, the countries key uses JSONPath and States.ArrayUnique to select the unique list of countries, the countriesCount key is the length of the countries, the iterator key is initialised at 0, and the output key is initialised with an empty map.

Then we enter a loop. The loop will continue whilst the iterator is less than the length of countries. We then use a Pass state to set the country key to the country at the iterator index of the countries list. We then use one more Pass state increase the iterator with:

States.MathAdd($.iterator, 1)

We also set the output key to the following (spaced for visibility):

States.JsonMerge(
    States.StringToJson(
        States.Format(
            '\{"{}":{}\}',
            $.country,
            States.JsonToString(
                $.original[?(@.country == $.country)]['displayName']
            )
        )
    ),
    $.output
, false)

The above performs the following transformations:

Retrieve the list of all displayName strings within the original key, filtering where the country key is equal to the country within the original key entries which we previously created using JSONPath
Convert that list to a JSON string
Create a new JSON-compatible string where the key is the country and the value is the above string-encoded array of names
Convert the string to a JSON object
Merge that object with the output variable

We’re basically adding the country code as a key of the output JSON object one at a time, then increasing the iterator to reference the next country in the list.

Once it has completed the loop, we are left with our final output.

Finishing up

I have provided a CloudFormation template that contains the full state machine and associated connection here. Feel free to deploy this into your own AWS account and try it yourself.

The HTTPS Endpoints feature is a very useful addition to the Step Functions service that I believe will have huge uptake. I personally want to do more with the Step Functions service as I believe more architectures can be more than serverless, they can be “functionless” (i.e. no Lambda functions). I would however like to see more useful intrinsics become available in the service. As you can see from this post, developers are often pushing the limits of what is available. Consider this my #awswishlist item.

A big thank you to Aidan Steele for helping review this post. If you liked what I’ve written, or want to hear more on this topic, reach out to me on 𝕏 at @iann0036.

Swiping right on the AWS WAF CAPTCHA challenge

2023-07-25T00:00:00+00:00

In 2021, AWS WAF introduced a new CAPTCHA feature to help protect sites against bot traffic. The release had some mixed reviews but the idea was that it was an effective protection against programmatic solvers or “bots”.

In this post, I walk through my methodology for beating one of the CAPTCHA challenges presented programmatically. If you’d like to follow along, you can try the CAPTCHA challenges yourself here.

The AWS WAF CAPTCHA system

The CAPTCHA feature in AWS WAF is an optional action as a result of a match against customer-defined rules. It is intended to be an option to help bridge the difficult decision of a hard deny or hard allow when client heuristics may appear suspicious but not outright bot-like.

When triggered, the action prompts viewers of a website with interactive challenges designed to test that a human viewer is real and block bots seeking to crawl or disrupt human traffic. At launch, and to this day, there are two challenges available which I will call the “car maze” and “shape match” challenges.

I created a Twitter (𝕏?) thread about beating the car maze challenge when it was originally released which you can read here:

Had a bit of fun today with the WAF CAPTCHA thing. The car maze turned into a fun programming challenge! 1/ pic.twitter.com/D6Rf4SZGy4
— Ian Mckay (@iann0036) November 14, 2021

I will note that there have been some changes since writing the thread and discussing my findings with the AWS WAF service team that make the car maze challenge slightly more complex, though the same concepts still broadly apply.

Let’s go through the same process with the shape match challenge!

Shape matching

The shape match challenge features an image of 5 random 3D shapes lined up horizontally which has been split across the vertical axis and reordered. The interface gives you a slider which you can move to match usually only one shape at a time and gives you instructions as to which shape to match up and submit. The bottom section wraps as you drag the slider.

The available shapes are: ball, cone, cube, cylinder, donut, knot and pyramid.

The challenge presents both halves of the shapes as a single JPEG image, always at a 320x160 resolution. Taking a similar approach as the car maze solve, I’m using HTML canvas to inspect the image, extract pixel data and draw for my own visualization. For my first step, I sample the top-left pixel colour and eliminate these pixels from consideration. Because the challenge is a JPEG, some colour blending and artifacts are present so in most of the below steps I check for colour closeness by ensuring the RGB channels are within a small boundary (in this case, no more than 7 away). The top and bottom 80 pixels of the Y-axis represent the top and bottom sections, respectively.

I now want to identify the location and width of the shapes at the midline for the top and bottom sections. The shapes in the challenge always have a clear separation between them, so in order to do this I move left-to-right at just above and below the midline (skipping the exact pixels on the midline, as JPEG artifacting can sometimes merge the pixels at y=79 and y=80). When I hit a non-background pixel, I mark the starting point of the shape. Once I hit a background pixel again, I can presume the start and stop points on the X-axis.

This gives me a set of values which intersect at the midline, however there are typically more values than the 5 shapes that are present. This is because shapes like the donut and knot intersect the midline at multiple points. To overcome this, we need to find any space in between where the shapes hit the midline where there isn’t a clear path to the relative extremes of the axis (i.e. where it is presumed to be in the center of the donut / knot). We take the middle of each of the clear spaces and start drawing a line towards the extreme of the axis, allowing a deviation to the left or right if clear space is present. Any line that does not reach the axis extreme is considered to be within the shapes, so these points are aggregated with regard to the shape boundary at the midline. This finally provides us with 5 positions and widths for both the top and bottom sections.

Because the donut always has two midline points which are of roughly equal width, we can mark this as a high probability match straight away. Additionally, if we see a single shape with more than 2 midline point intersections we can safely assume it is of the knot as this is the only shape that does this. At this point, I can start drawing the resulting shapes on individual canvases and mark those which are assumed during development.

We can then use the widths of the top and bottom shape midline intersections and find roughly matching widths. This gives us strong candidates for matching top and bottom section shapes, allowing us to calculate the relative X-axis offset needed to create the shapes. Under good circumstances, we now have 5 completed shapes but no way of identifying at least 3 of them.

In order to discover more information about the potential shapes, we calculate more landmark points to gain additional heuristics on the shape type. These points are calculated by the following:

Point 1: From the extreme left side at the midline, move towards the Y-axis extreme
Point 2: From the extreme right side at the midline, move towards the Y-axis extreme
Point 3: From the X-axis center at the midline, move towards the Y-axis extreme - if blocked, deviate left if able
Point 4: From the X-axis center at the midline, move towards the Y-axis extreme - if blocked, deviate right if able

Here are the paths that discovery takes to find the landmark points:

A ball shape always has a short Y-axis travel for points 1 and 2 for both sections, as well as a short X-axis travel from the center of the midline for points 3 and 4. The Y-axis travel for points 3 and 4 are generally identical and have roughly the same value as the X-axis travel for points 1 and 2.

A cone or pyramid shape typically also has a short Y-axis travel for points 1 and 2 in the top section, but a large Y-axis travel for all points in the bottom section.

A cube or cylinder generally has a roughly matching X-axis and Y-axis for the diametrically opposing points (point 1 in the top and point 2 in the bottom, and vice-versa).

Although it is challenging to decide between a cone/pyramid and cube/cylinder due to their shape similarities, there is one more trick we can use. Taking a path across the X-axis just below the midline, track the colours during movement. If the colour always gradually changes slightly, we can assume there is a gradient and the shape is a cone or cylinder. If there is exactly one or two colours, these represent the visible faces of a pyramid or cube.

We’ve now successfully identified each shape and their offsets.

Solving the challenge

The challenge generally accepts an offset value as its answer and so without any UI interference we could simply respond with a network request programmatically. However, I wanted to see the actual solution occur so I looked into actually performing the sliding action.

I had never programmatically moved a slider before and it turns out it is actually a rare automation to achieve, but it is possible. I came across this StackOverflow answer which showed I can create custom mousedown, mousemove and mouseup Mouse Events which worked in order to drag the slider. Notably, there was some math required to slide to the correct position, as the image width was 320 pixels, the slider would drag a maximum of 274 pixels, and the challenge solution endpoint accepted an answer between 0 and 255.

Occasionally, identification would fail due to an edge case or similar, however this simply meant that a new challenge would load and the automation could try again immediately. There seems to be no lockout or escalation of difficulty.

The road not travelled

There were a few approaches I could have taken during the development of this solution, however I took what I thought was the simplest and easiest to understand solution. I did look into using the JavaScript version of OpenCV, which I could pretty easily use to find the contours of the shapes and I could have used this to assist with some edge case resolution.

Additionally, the audio-based accessibility CAPTCHA alternative still remains for those in the speech recognition space looking for a fun challenge.

Final thoughts

The AWS WAF CAPTCHA remains an effective deterrent for all but the most determined of bot authors. I don’t envy the position the AWS WAF service team members are in. They are charged with creating a novel, interactive CAPTCHA challenge that has little cognitive load for users but remains challenging enough that it isn’t easily toppled by bots. I believe that if there were a constantly evolving rotation of new WAF challenge types we would have an effective protection purely based on the bot authors ability to adapt. Sadly this hasn’t yet happened. Features like Bot Control seem to be a far more effective way of dealing with bot traffic without generally affecting users, so I’d recommend that instead.

If you liked what I’ve written, or want to hear more on this topic, reach out to me on Twitter (or whatever it’s called now) at @iann0036.

Cedar: Avoiding the cracks

2023-07-06T00:00:00+00:00

With the open-source release of the Cedar engine and the general availability release of Amazon Verified Permissions, more and more engineers are considering integrating Cedar into their own systems for authorization, but what do policy authors need to consider to avoid unexpected outcomes?

In this post, I’ll walk through my experiences in where policy authoring can go wrong and the steps you can take to overcome these issues. This post will walk through some advanced evaluation scenarios, so if you’re new to the Cedar language I highly recommend you first read my introductory post on the topic, Cedar: A new policy language.

Non-unique entity identifiers

Though I mentioned it in my previous post, it’s important to always use unique identifiers for entities to ensure they do not get re-used in the future. The reason this may be a problem is that a reliance may start to occur on the entity, the entity goes away at some point in time, then a new entity of the same name comes into existence at a later point.

For example, consider the following statement:

permit(
    principal == User::"John",
    action,
    resource == Account::"Corporate"
);

If the user named John leaves the company, and then another John joins the company and happens to take the same entity identifier, it’s possible for the new John to inherit some privileges he should not be entitled to. The Cedarland blog has some more detail on the reasoning behind this.

Solutions

Always use unique identifiers, such as the identifiers your IdP provider uses, to uniquely identify principals. Additionally, use resource identifiers which are also unique for the context provided. Comments and annotations can help you keep track of identifiers where necessary.

permit(
    principal == User::"9a6afab1-5a37-4c90-aa40-24277b93ca28", // John Smith
    action,
    resource == Account::"710f18bc-b8ab-4313-b362-8e6264cfcf91" // Corporate Account
);

Invalid statements

Invalid statements not being evaluated is in my opinion one of the easiest ways to get an unexpected result from your policy evaluations. Consider the following policy:

permit(
    principal,
    action == Action::"Connect",
    resource
);

forbid(
    principal,
    action == Action::"Connect",
    resource == Endpoint::"AdminEndpoint"
) unless {
    context.viaAdminNetwork == true
};

The intention behind the policy is to allow connections to all endpoints except the admin endpoint unless the context object has the viaAdminNetwork key set to true. Unfortunately, the implementation of the context object in this example is that the viaAdminNetwork key is omitted, not false, if the call does not come from the admin network.

The result of this is that the forbid statement is not processed as there is an evaluation error due to the missing key. However, as the permit statement has been evaluated, and there are no other valid forbid statements, the result is an allow of the call. Even though the evaluated result is allow, there will be errors in the diagnostic return, as you can see from this Cedar playground screenshot:

There is more discussion on the reasoning for this behaviour over at the Cedlarland blog.

Solutions

Cedar has a validation engine that uses a schema to define the properties of entities within your system. This allows Cedar to warn you during the authoring phase when policies may not be valid. It is a best practice that you always construct a schema for your system.

The following schema would allow a developer to catch the unsafe usage of the attribute:

{
    "": {
        "entityTypes": {
            "Endpoint": {
                "shape": {
                    "type": "Record",
                    "attributes": {}
                }
            }
        },
        "actions": {
            "Connect": {
                "appliesTo": {
                    "resourceTypes": ["Endpoint"],
                    "context": {
                        "type": "Record",
                        "attributes": {
                            "viaAdminNetwork": { "type": "Boolean", "required": false }
                        }
                    }
                }
            }
        }
    }
}

Where possible, the inputs provided by the context object should be predictable. The developer may consider always setting the viaAdminNetwork key to simplify.

Alternatively, we can also modify the policy to test for the presence of the key itself, as shown:

permit(
    principal,
    action,
    resource
);

forbid(
    principal,
    action,
    resource
) unless {
    context has "viaAdminNetwork" && context.viaAdminNetwork == true
};

Developers might also consider overriding an allow result if any evaluation errors are present in the evaluation response, if that outcome is more desirable.

Dangers of short-circuiting

Short-circuiting is a performance feature of the Cedar language which allows it to skip evaluation of specific expressions that should not affect the result of the policy evaluation. It is present under the following conditions:

expression1 && expression2: expression2 is not evaluated when expression1 is false
expression1 || expression2: expression2 is not evaluated when expression1 is true
if expression1 then expression2 else expression3: expression2 is not evaluated when expression1 is false and expression3 is not evaluated when expression1 is true

This is typically a good thing, however it will not produce an error due to an invalid expression unless it actually evaluates that expression. For example, consider the below policy:

permit (
    principal,
    action == Action::"login",
    resource
)
when { context.isPrimarySite == true || principal.isBreakGlasEntity == true };

Note that this policy has the typo isBreakGlasEntity, which is missing an ‘s’. The intention behind the policy is that the login action is permitted only when accessing from the primary site under normal conditions, or if the principal is a special “break glass” entity under any conditions. This policy works under normal conditions, but due to the typo will error and not permit the break glass entity when they are most needed.

Solutions

A Cedar schema should again be used to determine the valid entity attributes during the entity modelling process and warn of inconsistencies during the policy authoring phase.

The following Cedar schema should be used to help find the typo during the authoring time of the policy:

{
    "": {
        "entityTypes": {
            "User": {
                "shape": {
                    "type": "Record",
                    "attributes": {
                        "isBreakGlassEntity": { "type": "Boolean", "required": true }
                    }
                }
            }
        },
        "actions": {
            "login": {
                "appliesTo": {
                    "principalTypes": [ "User" ],
                    "context": {
                        "type": "Record",
                        "attributes": {
                            "isPrimarySite": { "type": "Boolean", "required": true }
                        }
                    }
                }
            }
        }
    }
}

In addition to schema validation, it is also important to perform positive and negative testing against your policies (in a local or non-production environment) to ensure the policies will act in the way you expect for critical paths.

Ambiguous entity type

When writing condition statements which interact with an entity store, entities don’t have an inherit type associated with them. Consider the following entity store:

[
  {
    "uid": "User::\"alice\"",
    "attrs": {
      "active": true
    }
  },
  {
    "uid": "Action::\"redeemValidTicket\""
  },
  {
    "uid": "Ticket::\"someTicketID\"",
    "attrs": {
      "active": false
    }
  }
]

and the policy:

permit (
    principal,
    action == Action::"redeemValidTicket",
    resource
)
when { resource.active == true };

The intention behind this is to allow ticketholders redeem active tickets. The implementing developer allowed the full resource entity ID ("Ticket::\"someTicketID\"") be passed in as the resource input. Alice can’t redeem the "Ticket::\"someTicketID\"" resource as it is marked as not active, however Alice can perform a successful redemption with the resource entity ID "User::\"alice\"". Even though her user active attribute was never intended for that purpose, it nonetheless can lead to an unexpected allow.

Solutions

The developer could enforce that the “Ticket::” prefix is used (or perform the concatenation themselves).

The entity store could be modified to provide a unique attribute that the policy could match on using the has operator (resource has "ticketIssueDate").

The entity store could also be modified to place tickets in a new entity type “TicketGroup” using the parents construct and enforce via policy that the resource is within this group (resource in TicketGroup::"IssuedTickets").

Additionally, there is also a pending RFC that is discussing introducing an is operator to perform entity matching.

Unexpected order of operations

Like other languages, Cedar has a de-facto order of operations due to the way the grammar is constructed. This means that operations such as math works as you would expect:

permit (
    principal,
    action,
    resource
)
when { 1 + 2 * 3 + 4 * 5 == 27 }; // always true

It’s important to read and understand the grammar before constructing complex and ambiguous policies to avoid unintended effects. Consider the below policy:

permit (
    principal,
    action,
    resource
)
when {
    if resource.owner == principal then true else false &&
    resource.isRestricted == false
};

The intention behind the policy is to allow access when the principal is the resource owner and the resource is not restricted, however the effect of the policy is that a principal who is the resource owner is permitted access even when the resource is marked as restricted.

This is because the order of operations for an if-then-else operation is higher than that of the && operation and so the evaluation of the above condition is intrinsically like so:

if (resource.owner == principal) then (true) else (false && resource.isRestricted)

Solutions

Read the grammar when in doubt of the order of operations.

If you are ever in doubt, or simply want to be more explicit, use parentheses to explicitly show the intended grouping of operations:

permit (
    principal,
    action,
    resource
)
when {
    (if resource.owner == principal then true else false) &&
    resource.isRestricted == false
};

Side channels

Issues can often arise from the specific implementation that surrounds the use of Cedar, whether via Amazon Verified Permissions or a direct engine implementation. The engine can only evaluate against the inputs you have provided and if those inputs are not sanitized or invalid, it can lead to a compromise.

Late last year, the popular json5 library released a security advisory regarding the potential for prototype pollution. If you were to allow a user to specify their own context object, but override certain keys which were used in sensitive operations, an attacker could use this vulnerability to manipulate the inputs the Cedar engine receives.

// userInput = '{"foo": "bar", "__proto__": {"isAdmin": true}}'

const ctx = JSON5.parse(userInput);
if (secCheckKeysSet(ctx, ['isAdmin', 'isMod'])) {
  throw new Error('Forbidden...');
}

return avpclient.isAuthorized({
  'context': ctx,
  ...
});

Solutions

As always, a healthy supply-chain security program is recommended for organizations who make heavy use of external libraries. Input sanitization is also an important step to ensure that the engine can make appropriate authorization decisions.

As more and more built-in integrations become available, take advantage of these to shift more of the burden outside of your responsibility and avoid side-channel issues.

Wrapping up

As new language bindings, AWS integrations, external integrations, and even changes to the Cedar language itself continue to be produced, the overall community and ecosystem is growing. The scenarios above highlight the importance of a solid understanding of the language, but also solutions to help you overcome these hurdles and scale your authorization logic faster than would otherwise be possible.

If you liked what I’ve written, or want to hear more on this topic, reach out to me on Twitter at @iann0036. You can also join the discussion over at the official Cedar Slack workspace.

Exploring Amazon VPC Lattice

2023-04-01T00:00:00+00:00

(yes, that is a picture of my breakfast)

Today, AWS has released Amazon VPC Lattice to General Availability. This post walks through creating a simple VPC Lattice service using CloudFormation, and takes a look at the service overall.

VPC Lattice was my #1 favourite announcement of AWS re:Invent 2022, so I’m excited to see it released today. As of the time of writing, it’s available in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), and Europe (Ireland).

How it works

VPC Lattice is a service that enables you to connect clients to services within a VPC. It is very similar to AWS PrivateLink (also known as private VPC Endpoints), but with a key difference.

Whilst PrivateLink works by placing Elastic Network Interfaces within your subnet, which your clients can hit to tunnel network traffic through to the destination service, VPC Lattice works by exposing endpoints as link-local addresses. Link-local addresses are (generally) only accessible by software that runs on the client instance itself.

AWS has carved out the range 169.254.171.0/24 for VPC Lattice’s use, typically routing directly to 169.254.171.0 (there’s also an IPv6 equivalent). This is not the first network that AWS exposes via link-local addresses. You may know of:

EC2’s Instance Metadata Service, which is located at 169.254.169.254
Route 53’s DNS Resolver, which is located at 169.254.169.253
ECS’s Task Metadata Endpoint, which is located at 169.254.170.2
Amazon Time Sync Service (NTP), which is located at 169.254.169.123

Generally, these endpoints are automatically available to clients within the VPC network without any special routing or security rules. VPC Lattice differs from this slightly, as it requires Security Groups and NACLs to allow traffic to and from the VPC Lattice data plane at 169.254.171.0/24 on whichever port the destination service exposes. I was pretty surprised by this requirement when I saw it as it’s the first link-local address to need this, but it does give network administrators some basic control. Generally, it’s advised to use a managed prefix list instead of the exact range above, as it’s subject to change.

Targets which VPC Lattice connects to closely match that of load balancing target groups, including EC2 instances, VPC IP addresses (both IPv4 and IPv6), Lambda functions, and ALBs. An EKS-specific target type is in private beta as of the time of writing.

A walkthrough

For this walkthrough, we’ll discuss the various components needed for a VPC Lattice setup. For simplicity, we’ll be creating a Lambda function as a client (initiates a HTTPS request), and another Lambda function as a server (responds to the HTTPS request). If you want to skip ahead, here’s the completed template.

Let’s begin by creating a basic VPC. The VPC will have two private subnets, but we won’t add any direct routing between them. For simplicity, we’ll also skip adding Network ACLs.

Resources:

  # Basic VPC

  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      EnableDnsSupport: true

  PrivateSubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      CidrBlock: 10.0.0.0/24
      MapPublicIpOnLaunch: false
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: Private Subnet (Source Subnet)
      AvailabilityZone: !Select
        - 0
        - Fn::GetAZs: !Ref AWS::Region

  PrivateSubnet2:
    Type: AWS::EC2::Subnet
    Properties:
      CidrBlock: 10.0.1.0/24
      MapPublicIpOnLaunch: false
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: Private Subnet (Destination Subnet)
      AvailabilityZone: !Select
        - 1
        - Fn::GetAZs: !Ref AWS::Region

  RouteTablePrivate1:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: Private Route Table (Source Subnet)

  RouteTablePrivate1Association1:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref RouteTablePrivate1
      SubnetId: !Ref PrivateSubnet1

  RouteTablePrivate2:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: Private Route Table (Destination Subnet)

  RouteTablePrivate2Association1:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref RouteTablePrivate2
      SubnetId: !Ref PrivateSubnet2

Next, we’ll create the service itself. The service will be a Lambda function which performs a basic successful response to any requests, whilst including it’s own event payload in its response body. The function will be within the second private subnet within the VPC, and its security group will only have a single inbound rule from the VPC Lattice service on the port in which it serves.

  # Inbound Lambda (Service)

  InboundLambdaFunctionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
      Policies:
        - PolicyName: root
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - xray:PutTraceSegments
                  - xray:PutTelemetryRecords
                  - ec2:CreateNetworkInterface
                  - ec2:DescribeNetworkInterfaces
                  - ec2:DeleteNetworkInterface
                Resource: '*'

  InboundLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt InboundLambdaFunctionRole.Arn
      TracingConfig:
        Mode: Active
      Runtime: python3.9
      Timeout: 10
      Code:
        ZipFile: |
          import os
          import json
          import http.client

          def handler(event, context):
            print(event)
            return {
              "statusCode": 200,
              "body": json.dumps({
                "success": "true",
                "capturedEvent": event
              }),
              "headers": {
                "Content-Type": "application/json"
              }
            }
      VpcConfig:
        SecurityGroupIds:
          - !Ref InboundLambdaFunctionSecurityGroup
        SubnetIds:
          - !Ref PrivateSubnet2

  InboundLambdaFunctionSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for InboundLambdaFunction
      VpcId: !Ref VPC
      SecurityGroupEgress: []
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 169.254.171.0/24 # should be the prefix list instead, this'll work though
      GroupName: demo-inboundsg

Next up, we’ll create the components of the VPC Lattice service itself. This includes:

The service network
A security group which controls which clients may access the service network
The service we are creating
A listener for the service (HTTPS on port 443)
A target group for the listener to point to, with an initial target of the previously created Lambda function

To keep things simple, we’re not adding an auth policy for the service network or the service itself.

  # VPC Lattice

  VPCLatticeServiceNetwork:
    Type: AWS::VpcLattice::ServiceNetwork
    Properties:
      Name: demo-servicenetwork
      AuthType: NONE

  VPCLatticeServiceNetworkSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for service network access
      VpcId: !Ref VPC
      SecurityGroupEgress: []
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: !GetAtt VPC.CidrBlock
      GroupName: demo-servicenetworksg

  VPCLatticeServiceNetworkVPCAssociation:
    Type: AWS::VpcLattice::ServiceNetworkVpcAssociation
    Properties:
      SecurityGroupIds:
        - !Ref VPCLatticeServiceNetworkSecurityGroup
      ServiceNetworkIdentifier: !Ref VPCLatticeServiceNetwork
      VpcIdentifier: !Ref VPC

  VPCLatticeService:
    Type: AWS::VpcLattice::Service
    Properties:
      Name: demo-service
      AuthType: NONE

  VPCLatticeServiceNetworkServiceAssociation:
    Type: AWS::VpcLattice::ServiceNetworkServiceAssociation
    Properties:
      ServiceNetworkIdentifier: !Ref VPCLatticeServiceNetwork
      ServiceIdentifier: !Ref VPCLatticeService

  VPCLatticeListener:
    Type: AWS::VpcLattice::Listener
    Properties:
      Name: demo-listener
      Port: 443
      Protocol: HTTPS
      ServiceIdentifier: !Ref VPCLatticeService
      DefaultAction:
        Forward:
          TargetGroups:
            - TargetGroupIdentifier: !Ref VPCLatticeTargetGroup
              Weight: 100

  VPCLatticeTargetGroup:
    Type: AWS::VpcLattice::TargetGroup
    Properties:
      Name: demo-targetgroup
      Type: LAMBDA
      Targets:
        - Id: !GetAtt InboundLambdaFunction.Arn

It’s important to note that by associating the service network to the VPC, there are routes created within the VPCs route table to correctly send traffic destined towards 169.254.171.0/24 to the VPC Lattice service.

The target group also automatically adds a resource-based policy statement to the Lambda function for you (some other services require you to explicitly add an AWS::Lambda::Permission).

Finally, we’ll create the client which will send requests to the VPC Lattice service. Again, this will be driven via a basic Lambda function. Note that this time, the security group requires an outbound rule towards the VPC Lattice service.

  # Outbound Lambda (Client)

  OutboundLambdaFunctionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
      Policies:
        - PolicyName: root
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - xray:PutTraceSegments
                  - xray:PutTelemetryRecords
                  - ec2:CreateNetworkInterface
                  - ec2:DescribeNetworkInterfaces
                  - ec2:DeleteNetworkInterface
                Resource: '*'

  OutboundLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt OutboundLambdaFunctionRole.Arn
      TracingConfig:
        Mode: Active
      Runtime: python3.9
      Environment:
        Variables:
          ENDPOINT: !GetAtt VPCLatticeServiceNetworkServiceAssociation.DnsEntry.DomainName
      Timeout: 10
      Code:
        ZipFile: |
          import os
          import json
          import http.client

          def handler(event, context):
            conn = http.client.HTTPSConnection(os.environ["ENDPOINT"])

            conn.request("POST", "/", json.dumps(event), {
              "Content-Type": 'application/json'
            })
            res = conn.getresponse()
            data = res.read()

            print(data.decode("utf-8"))
      VpcConfig:
        SecurityGroupIds:
          - !Ref OutboundLambdaFunctionSecurityGroup
        SubnetIds:
          - !Ref PrivateSubnet1

  OutboundLambdaFunctionSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for OutboundLambdaFunction
      VpcId: !Ref VPC
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 169.254.171.0/24 # should be the prefix list instead, this'll work though
      SecurityGroupIngress: []
      GroupName: demo-outboundsg

Now that our template is done, we can deploy it via CloudFormation. If you got stuck anywhere, try the pre-made version here.

Once deployed, navigate to the Lambda console and find the function named something similar to “OutboundLambdaFunction”. Create a test event using any JSON object and invoke it. You should see the results from the service come back to you by observing the logs.

A note on pricing

It’s worth noting that the pricing model for VPC Lattice is different to that of PrivateLink and will probably end up costing you more overall. For N. Virginia, a PrivateLink service costs $0.01/hour per availability zone, plus $0.01/GB with volume discounts. For the same region, a VPC Lattice service costs $0.025/hour regardless of AZs, plus $0.025/GB with no volume discounts, plus $0.10 per million requests (with the first 300k requests per hour free).

Wrapping up

I’m interested to see how architectures will evolve with this new technology. Whilst PrivateLink remains more affordable and already widespread, I can see architects reaching for this new technology to improve their security posture and reduce the load on networking engineers.

If you liked what I’ve written, or want to hear more on this topic, reach out to me on Twitter at @iann0036.

Cedar: A new policy language

2023-01-11T00:00:00+00:00

Cedar is a new language created by AWS to define access permissions using policies, similar to the way IAM policies work today. In this post, we’ll look at why this language was created, how to author the policies, and some additional features of the language. The language was designed by the Amazon automated reasoning team for use in new services such as Amazon Verified Permissions, AWS Verified Access and likely other future services and integrations.

Why write a new language?

IAM policies, introduced over 11 years ago, have been integrated into the AWS ecosystem as the fundamental way to control both human and system access to AWS resources. IAM policies are highly optimized for AWS and have constructs (like ARNs) which make it not suitable for usage on principals and resources outside of AWS.

Cedar is a generalist language which has no implicit AWS constructs within it, and this allows it to be used as an authorization engine for non-AWS applications. This is why it’s used at the core of the Amazon Verified Permissions service, where AWS manages the policy dataset and allows systems to directly make authorization calls against the evaluation engine. Incidentally, the name “Cedar” was coined as a follow on from the internal policy language of IAM, “Balsa”.

Cedar is written in Rust, which makes it run in milliseconds, and was designed to be simple to reason about the effect of policies. For example, it allows for the creation of tooling which takes two policies and determines whether they are exactly equivalent, or whether there are authorization requests that would differ in the result when evaluated against each policy.

How it works

The policy evaluation engine for the Cedar language takes one or more policies, and evaluates whether a requested action is permitted or forbidden (allowed or denied). Cedar requires the principal making the request, the action being taken, the resource being accessed, and optionally additional request context at the time of the authorization call. Cedar also consumes the policies to be evaluated and may also use a list of entities (principals, actions and resources) that exist within your application, however these may be provided ahead of time or indirectly depending upon the service integration.

The request context object may be set by the requesting application or, in the case of AWS Verified Access, defined by the service.

Cedar has a playground which allows you to play with the engine itself. It is also currently integrated into the Amazon Verified Permissions and AWS Verified Access services. As of the time of writing, Cedar is not available as an open-source or otherwise downloadable library.

Syntax

A typical Cedar policy statement looks like the following:

permit(
    principal == User::"John",
    action == Action::"view",
    resource
)
when {
    resource in Folder::"John's Stuff" &&
    context.authenticated == true
};

A policy can contain a number of statements by simply appending them onto the policy document. The syntax is not whitespace dependent and may be compressed into a single line. Typically, principals and resources should use immutable identifiers and not names. The examples in this post use simple names for readability purposes only.

The policy contains the following parts:

The effect, which will always be either permit or forbid
The scope, which specifies the principals, actions, and resources to which the effect applies
Optionally, condition clauses, which may either be a when or an unless condition

Entities (principals, actions or resources) will always follow the format TypeOfEntity::"UniqueIdentifier". The type of entity may be further namespaced, for example, Company::Account::Department::Person::"John".

Entity types are ambiguous and not determined by their namespace. This means a single entity can be either a principal, action or resource, depending upon the specific context. The only exception is that actions must have their rightmost namespace use the keyword Action (i.e. Action::"MyAction", CustomNamespace::Action::"MyAction").

Evaluation logic

When evaluating a request, Cedar will consider all statements within the policy, and in the case of Amazon Verified Permissions, all policies provided in a policy store (as if it were one big policy). If any forbid statement matches the request, the request will be denied, regardless of any permit statements. If at least one permit statement matches the request (and no forbid statements match), the request will be allowed. If no statements match, the request will be implicitly denied.

If you’ve worked with AWS IAM, you’ll recognize Cedar’s policy evaluation logic is the same. This also means that ordering of statements in a policy is irrelevant and has no effect on the outcome of an authorization request.

Because forbid statements are applied universally without the ability to override, they are commonly used to craft guardrails across the entire policy store.

The scope

The scope is written in a way that almost looks like a set of arguments in a function. It always consists of the keywords principal, action and resource. Each of these keywords may optionally be followed by either an == Some::"Entity" or an in Some::"Group" to scope down the principals, actions or resources in which the statement applies to. In addition, an inline set in the form in [ Some::"Entity", SomeOther::"Entity", ... ] can be used for the action keyword only. When no keywords have this suffix, the policy applies to all requests, so long as the conditions are met.

The scope is generally used for role-based access control, where you would like to apply policies scoped to a specific defined or set of resources, actions, principals, or combination thereof.

Condition clauses

Condition clauses further limit whether a policy takes effect for the specific request. Typically policy statements will either have no condition clauses or one condition clause, however the syntax does allow for any number of condition clauses to form a statement.

Condition clauses are more flexible than the scope, featuring a basic set of operators to allow you to form a boolean result of acceptance based off of the principal, action, resource or context of the request, as well as the attributes or nested hierarchy of these entities where a list of entities has been defined. The use of logical operators such as && and || allow you to form long, complex conditions to match your specific requirements. The like operator allows you to perform string matching with the use of a * wildcard character.

Condition clauses are intended to perform attribute-based access control. Though it is possible to include scope conditions within a condition clause, exactly the way you would in the scope, it’s recommended that you retain those scope conditions in the scope for both readability and performance reasons.

Additional language features

Using the above syntax is all you need to start writing basic statements to permit or forbid access to your application, however there are some more features of the language which we’ll go through. Some of these features may not be available or useful depending upon the service in which Cedar is integrated into.

Comments

Policies may contain the // operator to add comments, which are particularly useful for indicating an abstract identifier, for example:

// the following was added by the accounts team
// it was approved by Jane Doe
permit(
    principal == User::"9a6afab1-5a37-4c90-aa40-24277b93ca28", // John Smith
    action,
    resource == Account::"710f18bc-b8ab-4313-b362-8e6264cfcf91" // MyCorp Dev Account
);

Entities

Cedar supports accepting a list of known entities (resources, actions or principals) within a system. This is helpful as you may author policies which interact with the hierarchy or attributes of the entities within condition clauses. When an authorization request is made, the principal, action and resource identifiers will correlate to the defined entity of the same identifier when present in the entity list.

The structure of the entity list differs from service to service. In the Cedar playground, the entity list looks like the following:

[
  {
    "uid": "User::\"john\"",
    "parents": [
      "UserGroup::\"Staff\""
    ],
    "attrs": {
      "department": "Hardware Engineering",
      "age": 30
    }
  },
  {
    "uid": "UserGroup::\"Staff\""
  }
]

In Amazon Verified Permissions (for an IsAuthorized call), the same entity list would look like this:

[
  {
    "EntityId": {
      "EntityType": "User",
      "EntityId": "john"
    },
    "Parents": [
      {
        "EntityType": "UserGroup",
        "EntityId": "Staff"
      }
    ],
    "Attributes": {
      "department": {
        "String": "Hardware Engineering"
      },
      "age": {
        "Long": 30
      }
    }
  },
  {
    "EntityId": {
      "EntityType": "UserGroup",
      "EntityId": "Staff"
    }
  }
]

We can use the known attributes in the entity to construct policies that permit or forbid access. For example:

permit(
    principal,
    action == Action::"Access",
    resource == Room::"Drinks Lounge"
) when {
    principal.age >= 18
};

This policy allows access only when the principal has the attribute “age”, and its value is equal to or greater than the number 18. If the age attribute wasn’t set, or the principal wasn’t defined at all in the entities list, this statement wouldn’t permit access.

The entities can also have the concept of a hierarchy, at any nesting level, to act based on this. For example:

permit(
    principal,
    action == Action::"Access",
    resource == Room::"Common Area"
) when {
    principal in UserGroup::"Staff"
};

This policy allows any entity which has a parent of the UserGroup::"Staff" entity access. Once again, if the entity isn’t defined or isn’t a child of UserGroup::"Staff", this statement wouldn’t permit access. The in operator applies to both direct children, as well as all descendants of those children. Additionally, the in operator also applies to the referenced parent, i.e. if the principal was UserGroup::"Staff" in the above example the policy would permit access.

Extensions

In addition to the base data types of strings, booleans, integers and sets/arrays, Cedar supports the additional data types of IP addresses, and decimals. These two data types can only be declared using a function call-like syntax, and can only be operated on using their in-built methods. These data types are known as extensions.

In the case of IP addresses, the syntax looks like the following:

permit(
    principal,
    action,
    resource
) when {
    ip(context.client_ip).isInRange("10.0.0.0/8")
};

The IP address type is created using the ip(...) syntax, and calls the isInRange(...) function to return a boolean. A similar effect is seen for the use of the decimal types:

forbid(
    principal,
    action,
    resource
) when {
    decimal(context.risk_score).greaterThan(decimal("7.2"))
};

Because Cedar does not allow any floating point types to be passed in, inputs must be in the form of a string (i.e. “8.24”). Decimal supports up to 4 digits after the decimal point.

Both extensions have a number of other methods available, all of which currently return a boolean result.

Policy templates

Policy templates is a Cedar feature useful for applying a common policy to a large group of principals or resources. A policy template allows you to add a variable substitution to the equality operators in the scope block for the principal and/or resource keywords. A policy template by itself is not effective, but allows policies to be created by simply providing the variable values instead of duplicating the full syntax. Policies generated from policy templates will automatically update if a policy template changes. A policy template may look like this:

permit(
    principal == ?principal,
    action == Action::"download",
    resource in ?resource
) when {
    context.mfa == true
};

The ?principal and ?resource keywords represent the variables that may be substituted. A policy created from this template would allow the principal to download all children of the resource when accessing using MFA.

Examples

The following is a set of examples to help you get started and understand the language.

Allow all

Policy:

permit(
    principal,
    action,
    resource
);

This statement permits all requests. It may be restricted by forbid statements elsewhere in the policy set.

Deny all

Policy:

forbid(
    principal,
    action,
    resource
);

This statement forbids all requests. It cannot be overridden and renders all other statements in the policy set useless.

Specific RBAC policy

Policy:

permit(
    principal == Customer::"John",
    action == Action::"checkout",
    resource == CheckoutCounter::"12"
);

This statement allows customer “John” to checkout at checkout counter 12.

When condition clause

Policy:

permit(
    principal,
    action == Action::"connectDatabase",
    resource == Database::"db1"
) when {
    context.port == 5432
};

Context:

{
    "port": 5432
}

This statement allows any principal to connect to database “db1”, so long as the “port” attribute in their request context is 5432.

Unless condition clause

Policy:

permit(
    principal,
    action in [HTTPMethod::Action::"GET", HTTPMethod::Action::"POST", HTTPMethod::Action::"DELETE"],
    resource
) unless {
    [Viewer::"anonymous", Viewer::"unknown"].contains(principal) ||
    context.waf_risk_rating >= 7
};

Context:

{
    "waf_risk_rating": 8.5
}

This statement allows any principal to perform a HTTP GET, POST or DELETE against any resource unless they are identified as an anonymous or unknown viewer or their WAF risk rating is greater than or equal to 7.

IP and decimal usage

Policy:

permit(
    principal,
    action == HTTPMethod::Action::"GET",
    resource
) when {
    (
        // local subnet or same machine
        ip(context.http_request.client_ip).isInRange(ip("10.0.0.0/8")) ||
        ip(context.http_request.client_ip).isLoopback()
    ) &&
    decimal(context.risk_score).lessThan(decimal("6.5"))
};

Context:

{
    "http_request": {
        "client_ip": "10.0.1.54"
    },
    "risk_score": "4.7"
}

This statement allows any principal to perform a HTTP GET against any resource when their IP address is within the 10.0.0.0/8 or loopback CIDR range and the value of the string-encoded risk score is less than 6.5.

Entity attributes

Policy:

permit(
    principal,
    action == SecuritySystem::Action::"swipeCardAccess",
    resource == Room::"Sydney Boardroom"
) when {
    principal.location like "Sydney*" ||
    principal.training.contains("All Access")
};

Entities:

[
    {
        "uid": "Employee::\"1453\"",
        "attrs": {
            "location": "Sydney East",
            "training": [
                "General"
            ]
        }
    },
    {
        "uid": "Employee::\"325\"",
        "attrs": {
            "location": "Los Angeles",
            "training": [
                "General",
                "All Access"
            ]
        }
    }
]

This statement allows any principal to swipe card access to the Sydney Boardroom if their location attribute starts with “Sydney” or their training attribute contains the “All Access” item. Both employees 1453 and 325 would be permitted under this statement.

Entity attributes relationship

Policy:

permit(
    principal,
    action == HTTP::Action::"GET",
    resource
) when {
    resource.owner == principal.username
};

Entities:

[
    {
        "uid": "User::\"Josh\"",
        "attrs": {
            "username": "josh1"
        }
    },
    {
        "uid": "File::\"blogpost.txt\"",
        "attrs": {
            "owner": "josh1"
        }
    }
]

This statement allows any principal to HTTP GET a file which they have ownership of. The entity User::"Josh" would be permitted to perform a HTTP::Action::"GET" on the File::"blogpost.txt" entity.

Entity inheritance

Policy:

forbid(
    principal,
    action,
    resource == Application::"oracle"
) unless {
    principal in Group::"Admins"
};

Entities:

[
    {
        "uid": "User::\"Ian\"",
        "parents": [
            "Group::\"Admins\"",
            "Group::\"Users\""
        ]
    }
]

This statement forbids any principal to perform any action against the oracle application unless they are a part of the Admins group. The entity User::"Ian" would be exempt from this forbid statement.

Policy template

Policy Template:

permit(
    principal == ?principal,
    action == Action::"Connect",
    resource == ?resource
);

Policy Variables:

principal: User::"Harry"
resource: VPN::"vpn1"

The policy created from the policy template allows the user Harry to connect to the VPN “vpn1”.

Wrapping up

The Cedar language is both excitingly new and comfortingly familiar. It opens a new world of possible use cases and, of course, a new set of challenges and considerations. I look forward to seeing how the language gets used in real world scenarios and the ways people will architect their applications around the services Cedar supports.

A big thank you to members from the identity and automated reasoning teams for helping answer some questions I had during the creation of this post. If you liked what I’ve written, or want to hear more on this topic, reach out to me on Twitter at @iann0036.

Patching the AWS JavaScript SDK for Service Workers

2022-01-11T00:00:00+00:00

The AWS JavaScript SDK supports Node.js, React Native and web browsers, but what if you’re running in a service worker? In this post, I’ll explain how I modified version 2 of the AWS JavaScript SDK to run within a service worker context.

Background

For the Former2 project, I produce browser extensions for most major browsers in order to bypass the lack of CORS for the majority of AWS services. This means that I embed a copy of the AWS JavaScript SDK in order to make the calls needed via the browser extension, which has authority to ignore the lack of CORS.

The browser extensions use a “manifest”, which details the functionality of the extension and what actions are permitted. Google is sunsetting version 2 of the manifest for Google Chrome and requires all extensions to move to manifest version 3 by the end of 2022. Along with some structural differences, one of the major changes required is to move from background pages (logic that runs in the background of an extension) to service workers.

Service workers (which are a subset of JavaScript workers) have greater limitations than background pages, including the lack of access to the DOM and its features, as well as the replacement of XMLHttpRequest for fetch. Service workers will also move to an inactive state if unused in a short period of time, meaning initialized variable data isn’t persisted, though I’ve skipped talking about my specific remediations to this in this article (hint: use IndexedDB).

The Challenge

Version 3 of the AWS JavaScript SDK is written in a way that it’s supported in a service worker context, but version 2 does not due to a variety of reasons. If you’re already using version 3 of the SDK, or are starting development on a service worker from scratch using version 3, you won’t have a problem.

As the Former2 project heavily relies on the syntax of version 2 of the SDK, as well as the fact that the service calls a majority of available services in the SDK, I wanted to avoid a migration effort to version 3 of the SDK. Others with existing projects making heavy use of SDK version 2 that are seeking to move to service workers (or CloudFlare Workers) might also benefit from this.

Note that this is not an official change, and these changes could break current or future functionality in unintended ways, so I don’t recommend you use this in a production context.

Attempting to import

After performing the changes to the browser extension manifest, my first issue was that the SDK script could no longer be directly loaded into the shared DOM model.

Before:

"background":  {
  "scripts": [
    "aws-sdk-2.1046.0.js",
    "bg.js"
  ]
},

After:

"background":  {
  "service_worker": "bg.js"
},

Service workers come with a way to load scripts using the importScripts() function. So I added the following to the top of my bg.js script:

importScripts("aws-sdk-2.1046.0.js");

This addition now silently failed the AWS calls I requested the extension make, without much debugging information.

It’s at this point that I’d like to call out Saurav Kushwaha for his prior work in this area, which overrides the XHRClient class used in the AWS namespace with fetch. I did need to perform a couple of slight modifications to properly return correct error codes however.

After replacing the XHRClient class, I was happy to see that some calls were successfully returning, but for some reason there was still some failures.

XML is hard

The failures I was seeing were coming from STS and S3, and I quickly realised that these were APIs that returned XML-based responses.

One immediate problem that actually showed error logs was that window was not defined, where parts of the SDK expected it to be available.

I quickly added a one-liner to make that available during initialisation:

if(!window){var window = {}};

After that change, I was now receiving an error that it could not load the XML parser.

Digging into the SDK, the logic looked like the following:

if (window.DOMParser) {
  // use the native DOM parser library
} else if (window.ActiveXObject) {
  // use the ActiveXObject to parse, a fallback for IE8 and lower
} else {
  throw new Error("Cannot load XML parser");
}

The SDK relies on the native DOM parser to interpret XML responses from those services, so in order to alleviate this I decided to find a polyfill to replace it. I came across xmldom module on npm and found it suitable for my needs. I did need to bundle this into a browser-compatible library, so used browserify to achieve this.

After importing the new DOM parser library for use by the SDK, I re-tested the calls which produced a valid response end-to-end. All done, or so I thought.

Something strange

Though my application now seemed to be working well, producing no errors and always returning valid responses, I noticed that many of my list calls (for example, S3.ListBucket) weren’t returning the resources within my account I expected.

I suspected some issues with the XML parser and dumped both the response of the HTTP call, and the object immediately after xmldom had parsed it. Both of these correctly showed the bucket names I was expecting, yet the response produced an empty array.

This one hurt my head. After debugging for probably a few hours, I found the issue. During the process of constructing the response in a clean format, the SDK requests the properties Element.firstElementChild and Element.nextElementSibling from the parsed object, however xmldom had not yet implemented these properties and so the iterators were silently failing.

After having a look at the xmldom library to investigate whether it could be easily patched, I instead simply implemented these properties as methods directly and replaced the SDK code which accesses these properties with my implementation, as shown below:

function getFirstElementChild(xml) {
  for (var i = 0; i < xml.childNodes.length; i++) {
    if (xml.childNodes[i].hasOwnProperty('tagName')) {
      return xml.childNodes[i];
    }
  }
  return null;
}

function getNextElementSibling(xml) {
  var foundSelf = false;
  for (var i = 0; i < xml.parentNode.childNodes.length; i++) {
    if (xml.parentNode.childNodes[i] === xml) {
      foundSelf = true;
      continue;
    }
    if (foundSelf && xml.parentNode.childNodes[i].hasOwnProperty('tagName')) {
      return xml.parentNode.childNodes[i];
    }
  }
  return null;
}

Wrapping up

After all the above changes were made, I was able to produce a version of the version 2 SDK which, from all the tests I’ve made, seems to work as intended within a service worker context.

I’ve made a version of the service worker-compatible SDK available on GitHub, should you want to compile your own. Refer to the official docs for specific compilation options, as they should work the same.

I got pretty close to abandoning this experiment, but I’m glad I persisted. I learned a lot about the internals of the SDK and got a working alternative in the end. If you liked what I’ve written, or want to tell me how terrible of an idea this was, reach out to me on Twitter at @iann0036.