Skip to content

Latest commit

 

History

History
590 lines (422 loc) · 28.1 KB

slides.md

File metadata and controls

590 lines (422 loc) · 28.1 KB
title titleTemplate favicon theme highlighter lineNumbers info drawings fonts background hideInToc
Mercator workshop
%s
seriph
shiki
false
## Mercator workshop Presentation slides for [Mercator](https://github.com/DNSBelgium/Mercator). This is a hands-on workshop in order to quickly setup the needed infrastructure on AWS.
persist
sans serif mono
Roboto
Roboto Slab
Fira Code
/cover1.jpeg
true

Mercator

DNS Belgium's crawler

Press Space for next page

hideInToc: true

Table of contents


layout: intro hideInToc: true

DNS Belgium

  • Registry for .be, .brussels and .vlaanderen
  • 1.750.000+ domain names

layout: intro

Mercator's design goals

Mercator was built with several goals in mind

  • 💪 Robustness: Components should gracefully handle the unexpected
  • 🐘 Scalability: Crawl .be zone in 24 hours
  • 🤸 Extensibility: New components should not disrupt the working of other components
  • 🔍 Observability: Be able to quickly spot
    • 🚴 performance issues
    • 💥 functional issues

layout: two-cols

Architecture

  • Event driven
  • Dispatcher
    • Reads work from input queue
    • Dispatches work to all modules (no pub-sub)
  • Each crawler module
    • Can be scaled independently from other modules
    • Reads work from its message queue
    • 1 message = 1 domain name = 1 transaction
    • Stores its results in its own DB schema and/or S3 bucket
    • Sends an ACK to its output queue
    • Separates fetching from processing

::right::

Mercator Architecure


layout: two-cols

Functionality

  • Use headless chrome to take screenshots + fetch HTML
  • Fetches DNS records
  • Geo IP on A & AAAA
  • Detects over 950 web technologies using Wappalyzer
  • Talks SMTP
  • VAT crawler (follows links until VAT found or max depth)
  • Extract HTML features (#social media links, …)
  • Language detection
  • REST API + basic UI

::right::

Mercator Architecure


layout: two-cols

Technology

  • Amazon SQS => automatic retries + DLQ
  • Spring Boot (Java) + NodeJS (TypeScipt) + Python + Scikit-learn
  • PostgreSQL
  • Raw HTML and screenshots on S3
  • Infra managed with Terraform
  • Deployed on Kubernetes using Helm => self-healing
  • Continuous Delivery using Jenkins pipelines
  • Local development using docker-compose
  • Grafana & Prometheus

::right::

Mercator Architecure


Dashboarding

Crawl rates


Scaling on Kubernetes

Crawl rates


Pipelines


Environment setup

<style> h1 { @apply absolute -bottom-2 left-5 text-amber-500; } </style>

Cloud9

In this workshop, we use AWS Cloud9, an IDE in the cloud. Cloud9 offers a linux environment with some tool pre-installed like Java or Docker.

To open your Cloud9 environment, you first need to login to AWS using the signin url you received by mail. Next, you can access your Cloud9 environment with the Cloud9 link in the same email.

We are going to use some extra tools during this workshop:

  • postgresql client to connect to the DB
  • jq and yq to parse JSON in the command-line
sudo yum -y install postgresql
sudo wget https://github.com/mikefarah/yq/releases/download/v4.22.1/yq_linux_amd64 -O /usr/bin/yq && \
  sudo chmod +x /usr/bin/yq
sudo wget https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64 -O /usr/bin/jq && \
  sudo chmod +x /usr/bin/jq

Cloud9

We also want to configure the AWS credentials in order to build the infrastructure

You normally received by email the necessary credentials, something like :

export DNS_AWS_ACCESS_KEY=<aws_key>
export DNS_AWS_SECRET_KEY=<aws_secret>
export DNS_MAXMIND_KEY=<maxmind_key>

You can copy paste that block into the cloud9 console.

Best to add it to the bash profile in case you open new tabs

echo "export DNS_AWS_ACCESS_KEY=$DNS_AWS_ACCESS_KEY" >>~/.bash_profile
echo "export DNS_AWS_SECRET_KEY=$DNS_AWS_SECRET_KEY" >>~/.bash_profile
echo "export DNS_MAXMIND_KEY=$DNS_MAXMIND_KEY" >>~/.bash_profile

Cloud9

Once environemnt variables are setup, we can use the following script to request temporary credentials for AWS. Boto3 is a AWS SDK for Python.

git clone https://github.com/DNSBelgium/mercator-workshop-centr.git
sudo pip install boto3
$(python mercator-workshop-centr/aws_assume_role.py export) # no output if successful

Finally, we need to add a bit more space on the disk. The following will change the size of the disk attached to your cloud9 environment.

mercator-workshop-centr/resize_ebs.sh 50

layout: cover dim: false background: /dariusz-sankowski.jpg

Build Mercator

<style> h1 { @apply absolute top-35 left-80 text-black; } </style>

Environment setup

You will need several things to build mercator

  • Java 11+
  • Docker
  • Docker-compose (for running Mercator locally)
  • Helm 3

If you want to run mercator, you also need a Maxmind license. Create an account and generate a license key.


Helm

Helm is a Kubernetes package manager. It allows easy management of kubernetes resources.

Helm is used by Gradle to build Mercator's Helm charts

To install Helm:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
helm version # version.BuildInfo{Version:"v3.8.1", ...}

We will go deeper into Helm once we deploy Mercator to Kubernetes.


Build Mercator

Clone the git repo and use Gradle to build Mercator

git clone https://github.com/DNSBelgium/mercator.git
cd mercator
./gradlew build -x test # 5 min

Gradle will compile the subprojects. It is also able to create the docker images and the Helm charts as we will see in the next slides.


layout: cover dim: false background: /cover_docker3.jpg

Run Mercator locally using docker-compose

<style> h1 { @apply absolute top-10; } </style>

Docker-compose

Docker Compose 1 is a tool for defining and running multi-container Docker applications. A YAML file definition contains the definition of the application’s services. With a single command, you create and start all the services from your configuration.

Installation
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" \
  -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version # docker-compose version 1.29.2, build 5becea4c
Create docker images
./gradlew dockerBuild # Create local docker images (7m)
Run Mercator
cat > .env <<EOF
MAXMIND_LICENSE_KEY=$DNS_MAXMIND_KEY
EOF
docker-compose up -d

Docker-compose

Send a request to crawl

aws sqs --endpoint-url http://localhost:4566 send-message --queue-url $(aws sqs --endpoint-url \
 http://localhost:4566 get-queue-url --queue-name mercator-dispatcher-input | jq -r .QueueUrl) \
 --message-body '{"domainName": "dnsbelgium.be"}'

See the logs

docker-compose logs dns-crawler

Connect to the DB and explore the result

PGPASSWORD=password psql -h localhost -U postgres postgres -f usefulequeries/count_names_crawled.sql

Destroy the local environment

docker-compose down

layout: two-cols

Some .be domains

  • hln.be
  • vrt.be
  • youtu.be
  • google.be
  • telenet.be
  • rtbf.be
  • sudinfo.be
  • belgium.be
  • 2dehands.be
  • proximus.be
  • zalanda.be

::right::

  • ns5.be
  • openprovider.be
  • groupon.be
  • yt.be
  • adidas.be
  • stepstone.be
  • irisnet.be
  • aviation24.be
  • bnpparibasfortis.be
  • bpost2.be
  • nbb.be
  • hubo.be
  • acerta.be
  • denk-it.be

layout: cover background: https://sli.dev/demo-cover.png

Deploy Mercator in the cloud


AWS

Amazon Web Services offers reliable, scalable, and inexpensive cloud computing services including :

SQS

Simple Queue Service is a distributed message queuing service. It enables you to decouple and scale microservices and distributed systems.

EKS

Elastic Kubernetes Service is a managed Kubernetes cluster by AWS.

S3

Simple Storage Service is an object storage service.


Terraform

Terraform is an open-source infrastructure-as-code software tool that provides a consistent CLI workflow to manage hundreds of cloud services. Terraform codifies cloud APIs into declarative configuration files.


Deploy Mercator in the cloud

Create the basic infrastructure needed for Mercator (25m).

This creates the network space and the kubernetes cluster.

cd ~/environment/
git clone https://github.com/DNSBelgium/mercator-infra-tf-aws.git
cd mercator-infra-tf-aws
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
terraform init -backend-config="bucket=mercator-workshop-setup-terraform-${AWS_ACCOUNT_ID}"
terraform apply --auto-approve

The following creates :

  • VPC and subnets
  • VPC endpoints for accessing AWS services (SQS, ECR, ...)
  • SQS queues
  • S3 buckets
  • ECR repositories
  • An IAM role for Mercator
  • A PostgreSQL database
  • A Kubernetes cluster

Push the docker image

In order for Kubernetes to be able to pull docker images, they must be accessible. At DNS Belgium, we use AWS ECR to host docker image. We don't yet publicly publish docker images for Mercator.

Terraform created the required ECR repository. We can push the previously built docker image with Gradle.

cd ~/environment/mercator
aws ecr get-login-password | docker login -u AWS --password-stdin \
  ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com
./gradlew dockerBuildAndPush -PdockerRegistry=${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/ \
  -PdockerTags=workshop

Connect to the EKS cluster

Tooling

In order to connect to Kubernetes, we need some extra tools and setup:

curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.23.4/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
kubectl version --client # Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", ...}
curl -LO https://github.com/derailed/k9s/releases/download/v0.25.18/k9s_Linux_x86_64.tar.gz
tar xvf k9s_Linux_x86_64.tar.gz
chmod +x k9s
sudo mv ./k9s /usr/local/bin/k9s
k9s version
aws eks update-kubeconfig --name mercator
kubectl get nodes

Install Helm charts

In mercator, each component has its own Helm chart. In order to install them all at once, we've created an umbrella chart, that depends on all component's chart.

We first need to get the different parameter from terraform into helm. The following script will generate a values.yaml file with all parameters for your account.

./generate_helm_values.sh

We can then use helm to install Mercator on Kubernetes

cd ~/environment/mercator
helm dependency build mercator-helm-umbrella
helm install -f ~/environment/mercator-infra-tf-aws/values.yaml mercator mercator-helm-umbrella

You can then see the pods with kubectl (or k9s).

kubectl get deployments
# or
k9s # (Ctrl-C to exit)

Deploy Mercator to the cloud

Send message to the crawler

aws sqs send-message --queue-url $(aws sqs get-queue-url --queue-name mercator-dispatcher-input \
 | jq -r .QueueUrl) --message-body '{"domainName": "dnsbelgium.be"}'

Access Mercator-ui

kubectl port-forward svc/mercator-mercator-ui 8080:80


layout: center class: text-center hideInToc: true

Learn More

Footnotes

  1. Docker Compose