title | titleTemplate | favicon | theme | highlighter | lineNumbers | info | drawings | fonts | background | hideInToc | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mercator workshop |
%s |
seriph |
shiki |
false |
## Mercator workshop
Presentation slides for [Mercator](https://github.com/DNSBelgium/Mercator). This is a hands-on workshop in order to
quickly setup the needed infrastructure on AWS.
|
|
|
/cover1.jpeg |
true |
DNS Belgium's crawler
- Registry for .be, .brussels and .vlaanderen
- 1.750.000+ domain names
Mercator was built with several goals in mind
- 💪 Robustness: Components should gracefully handle the unexpected
- 🐘 Scalability: Crawl .be zone in 24 hours
- 🤸 Extensibility: New components should not disrupt the working of other components
- 🔍 Observability: Be able to quickly spot
- 🚴 performance issues
- 💥 functional issues
- Event driven
- Dispatcher
- Reads work from input queue
- Dispatches work to all modules (no pub-sub)
- Each crawler module
- Can be scaled independently from other modules
- Reads work from its message queue
- 1 message = 1 domain name = 1 transaction
- Stores its results in its own DB schema and/or S3 bucket
- Sends an ACK to its output queue
- Separates fetching from processing
::right::
- Use headless chrome to take screenshots + fetch HTML
- Fetches DNS records
- Geo IP on A & AAAA
- Detects over 950 web technologies using Wappalyzer
- Talks SMTP
- VAT crawler (follows links until VAT found or max depth)
- Extract HTML features (#social media links, …)
- Language detection
- REST API + basic UI
::right::
- Amazon SQS => automatic retries + DLQ
- Spring Boot (Java) + NodeJS (TypeScipt) + Python + Scikit-learn
- PostgreSQL
- Raw HTML and screenshots on S3
- Infra managed with Terraform
- Deployed on Kubernetes using Helm => self-healing
- Continuous Delivery using Jenkins pipelines
- Local development using docker-compose
- Grafana & Prometheus
::right::
layout: cover dim: false background: https://images.unsplash.com/photo-1499951360447-b19be8fe80f5?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=3270&q=80
In this workshop, we use AWS Cloud9, an IDE in the cloud. Cloud9 offers a linux environment with some tool pre-installed like Java or Docker.
To open your Cloud9 environment, you first need to login to AWS using the signin url you received by mail. Next, you can access your Cloud9 environment with the Cloud9 link in the same email.
We are going to use some extra tools during this workshop:
postgresql
client to connect to the DBjq
andyq
to parse JSON in the command-line
sudo yum -y install postgresql
sudo wget https://github.com/mikefarah/yq/releases/download/v4.22.1/yq_linux_amd64 -O /usr/bin/yq && \
sudo chmod +x /usr/bin/yq
sudo wget https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64 -O /usr/bin/jq && \
sudo chmod +x /usr/bin/jq
We also want to configure the AWS credentials in order to build the infrastructure
You normally received by email the necessary credentials, something like :
export DNS_AWS_ACCESS_KEY=<aws_key> export DNS_AWS_SECRET_KEY=<aws_secret> export DNS_MAXMIND_KEY=<maxmind_key>You can copy paste that block into the cloud9 console.
Best to add it to the bash profile in case you open new tabs
echo "export DNS_AWS_ACCESS_KEY=$DNS_AWS_ACCESS_KEY" >>~/.bash_profile
echo "export DNS_AWS_SECRET_KEY=$DNS_AWS_SECRET_KEY" >>~/.bash_profile
echo "export DNS_MAXMIND_KEY=$DNS_MAXMIND_KEY" >>~/.bash_profile
Once environemnt variables are setup, we can use the following script to request temporary credentials for AWS. Boto3 is a AWS SDK for Python.
git clone https://github.com/DNSBelgium/mercator-workshop-centr.git
sudo pip install boto3
$(python mercator-workshop-centr/aws_assume_role.py export) # no output if successful
Finally, we need to add a bit more space on the disk. The following will change the size of the disk attached to your cloud9 environment.
mercator-workshop-centr/resize_ebs.sh 50
<style> h1 { @apply absolute top-35 left-80 text-black; } </style>
You will need several things to build mercator
- Java 11+
- Docker
- Docker-compose (for running Mercator locally)
- Helm 3
If you want to run mercator, you also need a Maxmind license. Create an account and generate a license key.
Helm is a Kubernetes package manager. It allows easy management of kubernetes resources.
Helm is used by Gradle to build Mercator's Helm charts
To install Helm:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
helm version # version.BuildInfo{Version:"v3.8.1", ...}
We will go deeper into Helm once we deploy Mercator to Kubernetes.
Clone the git repo and use Gradle to build Mercator
git clone https://github.com/DNSBelgium/mercator.git
cd mercator
./gradlew build -x test # 5 min
Gradle will compile the subprojects. It is also able to create the docker images and the Helm charts as we will see in the next slides.
<style> h1 { @apply absolute top-10; } </style>
Docker Compose 1 is a tool for defining and running multi-container Docker applications. A YAML file definition contains the definition of the application’s services. With a single command, you create and start all the services from your configuration.
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" \
-o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version # docker-compose version 1.29.2, build 5becea4c
./gradlew dockerBuild # Create local docker images (7m)
cat > .env <<EOF
MAXMIND_LICENSE_KEY=$DNS_MAXMIND_KEY
EOF
docker-compose up -d
aws sqs --endpoint-url http://localhost:4566 send-message --queue-url $(aws sqs --endpoint-url \
http://localhost:4566 get-queue-url --queue-name mercator-dispatcher-input | jq -r .QueueUrl) \
--message-body '{"domainName": "dnsbelgium.be"}'
docker-compose logs dns-crawler
PGPASSWORD=password psql -h localhost -U postgres postgres -f usefulequeries/count_names_crawled.sql
docker-compose down
- hln.be
- vrt.be
- youtu.be
- google.be
- telenet.be
- rtbf.be
- sudinfo.be
- belgium.be
- 2dehands.be
- proximus.be
- zalanda.be
::right::
- ns5.be
- openprovider.be
- groupon.be
- yt.be
- adidas.be
- stepstone.be
- irisnet.be
- aviation24.be
- bnpparibasfortis.be
- bpost2.be
- nbb.be
- hubo.be
- acerta.be
- denk-it.be
layout: cover background: https://sli.dev/demo-cover.png
Amazon Web Services offers reliable, scalable, and inexpensive cloud computing services including :
Simple Queue Service is a distributed message queuing service. It enables you to decouple and scale microservices and distributed systems.
Elastic Kubernetes Service is a managed Kubernetes cluster by AWS.
Simple Storage Service is an object storage service.
Terraform is an open-source infrastructure-as-code software tool that provides a consistent CLI workflow to manage hundreds of cloud services. Terraform codifies cloud APIs into declarative configuration files.
Create the basic infrastructure needed for Mercator (25m).
This creates the network space and the kubernetes cluster.
cd ~/environment/
git clone https://github.com/DNSBelgium/mercator-infra-tf-aws.git
cd mercator-infra-tf-aws
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
terraform init -backend-config="bucket=mercator-workshop-setup-terraform-${AWS_ACCOUNT_ID}"
terraform apply --auto-approve
The following creates :
- VPC and subnets
- VPC endpoints for accessing AWS services (SQS, ECR, ...)
- SQS queues
- S3 buckets
- ECR repositories
- An IAM role for Mercator
- A PostgreSQL database
- A Kubernetes cluster
In order for Kubernetes to be able to pull docker images, they must be accessible. At DNS Belgium, we use AWS ECR to host docker image. We don't yet publicly publish docker images for Mercator.
Terraform created the required ECR repository. We can push the previously built docker image with Gradle.
cd ~/environment/mercator
aws ecr get-login-password | docker login -u AWS --password-stdin \
${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com
./gradlew dockerBuildAndPush -PdockerRegistry=${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/ \
-PdockerTags=workshop
In order to connect to Kubernetes, we need some extra tools and setup:
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.23.4/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
kubectl version --client # Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", ...}
curl -LO https://github.com/derailed/k9s/releases/download/v0.25.18/k9s_Linux_x86_64.tar.gz
tar xvf k9s_Linux_x86_64.tar.gz
chmod +x k9s
sudo mv ./k9s /usr/local/bin/k9s
k9s version
aws eks update-kubeconfig --name mercator
kubectl get nodes
In mercator, each component has its own Helm chart. In order to install them all at once, we've created an umbrella chart, that depends on all component's chart.
We first need to get the different parameter from terraform into helm. The following script will generate a values.yaml file with all parameters for your account.
./generate_helm_values.sh
We can then use helm to install Mercator on Kubernetes
cd ~/environment/mercator
helm dependency build mercator-helm-umbrella
helm install -f ~/environment/mercator-infra-tf-aws/values.yaml mercator mercator-helm-umbrella
You can then see the pods with kubectl (or k9s).
kubectl get deployments
# or
k9s # (Ctrl-C to exit)
Send message to the crawler
aws sqs send-message --queue-url $(aws sqs get-queue-url --queue-name mercator-dispatcher-input \
| jq -r .QueueUrl) --message-body '{"domainName": "dnsbelgium.be"}'
Access Mercator-ui
kubectl port-forward svc/mercator-mercator-ui 8080:80