A Serverless Solution to Keeping Git Repositories Synchronized
Synchronizing Git repositories is a fairly common requirement. This article describes how we built a solution to replicate an AWS CodeCommit repo to any other Git repo in realtime, using just a Lambda function.
Why Sync?
There are many scenarios wherein you might want to sync up 2 or more Git repositories, preferably in realtime.
Backup
Keeping backups of code is probably the most common reason for mirroring a Git repo elsewhere. This includes use cases like automatically replicating your CodeCommit repositories across AWS regions or copying code from CodeCommit to GitHub/GitLab.
Although the Lambda-based solution described in this article can be used for backups, it’s a bit overkill for this use case. That’s primarily due to 2 major reasons:
- Most backups of Git repositories aren’t required to be other Git repositories. They’re usually just a ZIP or tarball of the repo uploaded to long-term storage like S3.
- Backups also need not be “real-time.” Most people are happy with scheduled backup solutions that run, say, every 6 hours or so, depending on how heavily your source repo is used.
Integrating Disparate Systems
Often, your code resides in a system that won’t play well with other tools in your development workflow. For example, your CI/CD provider might not support integration with your Git provider. In that case, it makes sense to keep a supported Git provider in sync with your existing Git repo using an automated solution. With both code repositories in sync, you can achieve a fully-automated real-time CI/CD experience.
Our Solution
Our solution focuses specifically on taking code from a CodeCommit repo & mirroring it to any other Git repo, be it another CodeCommit repo in another AWS region or account, or outside of AWS to GitHub, GitLab, etc.
Although there already exist solutions to this challenge, like the one described in the AWS blog Replicating and Automating Sync-Ups for a Repository with AWS CodeCommit, we wanted a low-maintenance & FREE solution that could be reused/redeployed/replicated for every pair of repositories we needed to sync. Serverless & Lambda were the obvious answer!
SAM App
We started by creating an AWS SAM application, that would eventually grow to be the one-click solution we need. The end result of the deployed app is as shown below:
The heart of the solution is the Lambda function. Since using the Git CLI is the easiest way to clone & push repositories, we wanted our Lambda function to run a Shell script, instead of the conventional Python or Node.js code.
Running Shell Scripts in Lambda
Although running Bash scripts in a Lambda function is easily doable as described in Run Bash Scripts in AWS Lambda Functions, running Git is a whole new ball game! Git doesn’t come preinstalled in the base Lambda runtime & installing it is rather cumbersome. It’s easier to take charge of the Lambda container itself & install everything we need in it. That’s how we arrived at using an Amazon Linux 2 container for our Lambda function.
Dockerfile
Start by creating a Dockerfile
to build the Lambda container:
FROM public.ecr.aws/lambda/provided
RUN yum update -y && yum install jq git -y && yum clean all
COPY bootstrap ${LAMBDA_RUNTIME_DIR}
COPY function.sh ${LAMBDA_TASK_ROOT}
CMD [ "function.handler" ]
The public.ecr.aws/lambda/provided
base image is the Amazon Linux 2 runtime. The next line installs jq
along with Git. The use of jq
is described later in this article.
Lambda Bootstrap
bootstrap
is an executable Bash script that will be invoked by the Lambda runtime interface client:
#!/bin/bash
set -euo pipefail
# Initialization - load function handler
source "$LAMBDA_TASK_ROOT"/"$(echo $_HANDLER | cut -d. -f1).sh"
# Processing
while true
do
HEADERS="$(mktemp)"
# Get an event. The HTTP request will block until one is received
EVENT_DATA=$(curl -sS -LD "$HEADERS" -X GET "http://${AWS_LAMBDA_RUNTIME_API}/2018-06-01/runtime/invocation/next")
# Extract request ID by scraping response headers received above
REQUEST_ID=$(grep -Fi Lambda-Runtime-Aws-Request-Id "$HEADERS" | tr -d '[:space:]' | cut -d: -f2)
# Run the handler function from the script
RESPONSE=$($(echo "$_HANDLER" | cut -d. -f2) "$EVENT_DATA")
# Send the response
curl -X POST "http://${AWS_LAMBDA_RUNTIME_API}/2018-06-01/runtime/invocation/$REQUEST_ID/response" -d "$RESPONSE"
done
Function Handler
When the Lambda function is invoked, the handler
function in function.sh
is called:
#!/bin/bash
export HOME=/tmp # so Git can write .gitconfig here
CLONE_DIR=/tmp/src-repo
# URL encode
SRC_USER=$(echo -n "$SRC_USER" | jq -sRr @uri)
SRC_PASS=$(echo -n "$SRC_PASS" | jq -sRr @uri)
DEST_USER=$(echo -n "$DEST_USER" | jq -sRr @uri)
DEST_PASS=$(echo -n "$DEST_PASS" | jq -sRr @uri)
SRC_REPO=${SRC_REPO/'https://'/"https://$SRC_USER:$SRC_PASS@"}
DEST_REPO=${DEST_REPO/'https://'/"https://$DEST_USER:$DEST_PASS@"}
function handler() {
rm -rf $CLONE_DIR
git clone --mirror "$SRC_REPO" $CLONE_DIR
cd $CLONE_DIR
git remote add dest "$DEST_REPO"
git push dest --mirror
echo 'DONE! Successfully mirrored source repo to destination!'
}
As seen above, the handler
simply clones the CodeCommit repo & pushes it to the destination repo. Notice how jq
is being used above the handler to URL encode the source & destination credentials. That’s because they’ll be embedded in the Git repo URLs to avoid Git prompting for credentials.
The --mirror
option in git push
above specifies that all refs under refs/
(refs/heads/
, refs/remotes/
, refs/tags/
, etc) will be mirrored to the remote repository.
That covers everything about the Lambda function itself. Let us now look at the SAM template that brings all the pieces together.
SAM Template
The SAM template contains just 1 resource, the Lambda function:
Resources:
AwsCodeCommitSync:
Type: AWS::Serverless::Function
Properties:
FunctionName: aws-codecommit-sync
PackageType: Image
Timeout: 900
ReservedConcurrentExecutions: 1
Metadata:
Dockerfile: Dockerfile
DockerContext: .
As seen above, it’s best to limit the number of concurrent executions of this Lambda function to just 1. We don’t want multiple instances getting triggered in parallel in case a lot of CodeCommit events are captured in a short period of time.
EventBridge Rule
The Events
property of the Lambda function creates the EventBridge rule that watches for CodeCommit events & triggers this Lambda:
Events:
AllCodeCommitEvents:
Type: EventBridgeRule
Properties:
Pattern:
source:
- aws.codecommit
account:
- !Ref AWS::AccountId
region:
- !Ref AWS::Region
resources:
- !Sub arn:aws:codecommit:${AWS::Region}:${AWS::AccountId}:${SourceCodeCommitRepoName}
detail:
repositoryName:
- !Ref SourceCodeCommitRepoName
Template Parameters
The template expects the following parameters:
SOURCE
- The name of the source CodeCommit repo, like
source-repo
. - The HTTPS Git clone URL of the source CodeCommit repo, like
https://git-codecommit.ap-south-1.amazonaws.com/v1/repos/source-repo
. - The Git username used to clone the source CodeCommit repo, like
iam-user-at-123456789012
. - And the Git password used to clone the source CodeCommit repo.
DESTINATION
- The HTTPS Git push URL of the destination repo, like
https://github.com/username/destination-repo.git
. - The Git username used to push to the destination repo, like your GitHub username.
- The Git password used to push to the destination repo. If using GitHub, create a personal access token with the
repo
scope & use it here.
All these parameters become environment variables to the Lambda function, which uses them to clone & push to the repositories.
Ready-to-Use App
This entire app is available on GitHub at https://github.com/harishkm7/aws-codecommit-sync. Just clone it to your system & follow the README
!
Note: Do not manually push any changes to the destination repository. It will cause conflicts later when the Lambda function pushes changes to it from the source repository. Treat the destination as a read-only repository, and push all your development changes to your source repository only.
About the Author
Harish KM is an AWS Developer at QloudX. He is passionate about creating zero-maintenance fully-serverless cloud-native solutions in AWS. With 20+ cloud & IT certifications, he is an expert in a multitude of technologies, especially serverless.