Installing Mesosphere DC/OS on Small Digital Ocean Droplets

Mesosphere DC/OS is a data center operating system, based on Apache Mesos and Marathon. It’s designed to run tasks and containers on a distributed architecture. It can be provisioned either on bare metal machines, within virtual machines or on a hosting provider (what some people like to call “the cloud.”). I wanted to see what was involved in setting up my own DC/OS instance, both locally and with a provider, for running some of my own projects in containers. I wanted to keep this cluster as low cost as possible, and ran into some issues with the Terraform installation in the DC/OS documentation¹. The following is a brief look at setting up a minimal DC/OS cluster on Digital Ocean.

Provisioning

For one of my projects, I created vSense, a devops provisioning system build around Vagrant and Ansible. It’s used for creating both development and production environments for BigSense, an open source sensor network system. Vagrant boxes can vary between providers, meaning the scripts need to be adjusted to handle differences between VirtualBox images for development and KVM base boxes for production. Thankfully, DC/OS does have an official Vagrant project², and supports deploying to hosted providers using a Terraform script¹.

The following can be used to bring up a local four node cluster (boot, manager, private agent and public agent) using local VirtualBox VMs:

git clone https://github.com/dcos/dcos-vagrant
cd dcos-vagrant
vagrant plugin install vagrant-hostmanager
vagrant up m1 a1 p1 boot

DC/OS provides documentation for installing nodes on several hosted platforms as well. The following is taken from their documentation for using Digital Ocean as a provider:

git clone https://github.com/jmarhee/digitalocean-dcos-terraform
cd digitalocean-dcos-terraform
cp sample.terraform.tfvars terraform.tfvars
# adjust your settings and API token
eval $EDITOR terraform.tfvars
terraform apply

DC/OS Nodes Running on Digital Ocean Droplets

It’s important to note that DC/OS, despite its name, is not really an operating system. It simply installs Docker and other packages to bootstrap itself on another Linux distribution. When using the Vagrant/VirtualBox installation above, it uses CentOS 7 for its individual virtual machines. Curiously for Digital Ocean, it installs itself onto CoreOS virtual machines.

Authentication

If you start with a fresh install of DC/OS and connect to the master node via HTTP, you’ll get an authentication page allowing the first account to be the administration account. By default, you cannot create this account. You are required to use one of the three default identity providers: Github, Microsoft or Google. DC/OS community edition has no built in authentication system. In order to integrate with LDAP, Active Directory or another identity provider, you must purchase the enterprise edition. The community edition allows you to override the default configuration, but only supports OAuth providers and only provides documentation for using the non-free service Auth0³.

I really hesitated here. I rarely ever use external authentication, opting to use a strong password algorithms with e-mail based registration instead. I considered figuring out how to override the default, but then caved to my impatience and authenticated via github. This was a bad idea. Not only did I start getting unsolicited SPAM from Mesosphere on the e-mail associated with my Github account.

Unsolicited E-Mail from Mesosphere to an E-mail associated with my Github Account

I stated getting SPAM for a secondary account I created within DC/OS.

Unsolicited E-Mail from Mesosphere to an E-mail for an account I created

Furthermore, the e-mail for the new user I manually created didn’t come from a locally running mail server that was part of DC/OS. It was relayed via a completely different third party:

From: DC/OS <help@dcos.io>
Subject: You've been added to a DC/OS cluster
Received: from [54.163.223.191] by mandrillapp.com id dafa457b3e374123b427c283824bfa0f; Sat, 26 Nov 2016 06:58:25 +0000
X-SWU-RECEIPT-ID: log_aad8fa045b93454cca9d5a9ccabc3504-3
Reply-To: <help@dcos.io>
To: <--->

Also, by default, DC/OS has telemetry enabled. If you’re using the Terraform script for installation, it can be disabled by adding telemetry_enabled: 'false' to the make-files.sh script, in the section where it creates the config.yml⁴. I highly recommend you disable telemetry before starting up a cluster, even locally with Vagrant.

The SPAM didn’t arrive for a couple of days after I experimented with DC/OS. However, it still bothers me that the official DC/OS provisioning tools enable telemetry by default. It’s not as bad as removing tracking from Alfresco, which is hard coded, but it is unnecessary and is most likely used for marketing purposes.

Minimal Cost

As I’ve mentioned, the minimum number of DC/OS nodes required by default is four. Upon talking to other DC/OS administrators, I’ve found that it’s not necessary to separate out public and private nodes. If you ran with only public nodes, your minimum would drop to three VMs. By default, the Terraform script mentioned above provisions all its nodes as 4gb, which currently run $40 USD/month on Digital Ocean.

If you’re a startup with funding, that isn’t an unreasonable price, even when you start scaling up for redundancy. However, if you’re a small shop trying to get off the ground with limited funding, or if you’re like me where you just want to host your personal projects cheaply, this can seem prohibitively expensive. The smallest size that Digital Ocean offers is a 512mb instance for $5 USD/month, which seems like it’d be more than adequate for the boot node.

Unfortunately, the management node must be a 1gb instance. Anything less leads to an unstable master. As we’ll see below, we can enable swap space on these nodes, but even the master agent is a heavy enough process that it will cause thrashing and lockups on anything less than 1gb of physical memory.

Boot	Management	Public	Price Per Month	Price Per Year
4gb	4gb	4gb	$120	$1440
512mb	1gb	2gb	$35	$420
512mb	1gb	2gb x2	$55	$660
512mb	1gb	1gb	$25	$300
512mb	1gb	1gb x2	$30	$360

Keep in mind that by not creating any private nodes, you are trading off the security offered by having non-public facing containers (such as load balances or web servers) running on nodes only connected to a private network. This is also a minimal non-redundant solution. Redundancy requires either 3 or 5 master nodes and additional agent nodes as well.

Startup Issues

I wanted to use the smallest images possible to save on hosting costs. Unfortunately, both master and agent nodes refuse to start on anything smaller than 2gb images. If you have failures, you can SSH into the individual nodes using your SSH key, the IP address from the Digital Ocean web interface and the user core like so:

ssh -i do-key -lcore <node_ip>

The failures seem occur during the bootstrapping process in the dcos-download.service:

journalctl -u dcos-download.service
-- Logs begin at Thu 2016-12-08 06:37:51 UTC, end at Thu 2016-12-08 07:31:18 UTC. --
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 systemd[1]: Starting Pkgpanda: Download DC/OS to this host....
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: *   Trying 104.131.142.20...
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: * TCP_NODELAY set
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: * Connected to 104.131.142.20 (104.131.142.20) port 4040 (#0)
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: > GET /bootstrap/e73ba2b1cd17795e4dcb3d6647d11a29b9c35084.bootstrap.tar.xz HTTP/1.
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: > Host: 104.131.142.20:4040
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: > User-Agent: curl/7.50.2
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: > Accept: */*
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: >
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < HTTP/1.1 200 OK
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Server: nginx/1.11.6
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Date: Thu, 08 Dec 2016 06:38:21 GMT
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Content-Type: application/octet-stream
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Content-Length: 581561548
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Last-Modified: Thu, 08 Dec 2016 06:37:03 GMT
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Connection: keep-alive
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < ETag: "5848ff8f-22a9eccc"
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: < Accept-Ranges: bytes
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: <
Dec 08 06:38:21 digitalocean-dcos-public-agent-00 curl[1568]: { [13032 bytes data]
Dec 08 06:38:27 digitalocean-dcos-public-agent-00 curl[1568]: * Failed writing body (456 != 16384)
Dec 08 06:38:27 digitalocean-dcos-public-agent-00 curl[1568]: * Curl_http_done: called premature == 1
Dec 08 06:38:27 digitalocean-dcos-public-agent-00 curl[1568]: * Closing connection 0
Dec 08 06:38:27 digitalocean-dcos-public-agent-00 curl[1568]: curl: (23) Failed writing body (456 != 16384)
Dec 08 06:38:27 digitalocean-dcos-public-agent-00 systemd[1]: dcos-download.service: Control process exited, code=exited status=23
Dec 08 06:38:27 digitalocean-dcos-public-agent-00 systemd[1]: Failed to start Pkgpanda: Download DC/OS to this host..
Dec 08 06:38:27 digitalocean-dcos-public-agent-00 systemd[1]: dcos-download.service: Unit entered failed state.
Dec 08 06:38:27 digitalocean-dcos-public-agent-00 systemd[1]: dcos-download.service: Failed with result 'exit-code'.

If I try to download this file manually within the node, I can retrieve it successfully. The size of the file is over 500MB. Even the smallest node option of 512mb (memory), has 20GB of disk space. Then I looked at the individual partition tables:

1GB Image:

$df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        483M     0  483M   0% /dev
tmpfs           499M     0  499M   0% /dev/shm
tmpfs           499M  324K  499M   1% /run
tmpfs           499M     0  499M   0% /sys/fs/cgroup
/dev/vda9        27G  579M   26G   3% /
/dev/vda3       985M  588M  347M  63% /usr
tmpfs           499M  499M  4.0K 100% /tmp
/dev/vda1       128M   39M   90M  30% /boot
tmpfs           499M     0  499M   0% /media
/dev/vda6       108M   64K   99M   1% /usr/share/oem
tmpfs           100M     0  100M   0% /run/user/500

2GB Image:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        987M     0  987M   0% /dev
tmpfs          1003M     0 1003M   0% /dev/shm
tmpfs          1003M  428K 1003M   1% /run
tmpfs          1003M     0 1003M   0% /sys/fs/cgroup
/dev/vda9        37G  1.9G   34G   6% /
/dev/vda3       985M  588M  347M  63% /usr
tmpfs          1003M  320K 1003M   1% /tmp
tmpfs          1003M     0 1003M   0% /media
/dev/vda1       128M   39M   90M  30% /boot
/dev/vda6       108M   64K   99M   1% /usr/share/oem
tmpfs           201M     0  201M   0% /run/user/500

The installation services are using the /tmp partition, and it’s obviously too small to complete downloading the bootstrap image. By default, tmpfs allocates half the size of available memory to its filesystem. The easy solution is to modify the section of make-files.sh that creates the do-install.sh script to ensure we have enough room on /tmp prior to installation. The Digital Ocean instances also don’t come with any swap, so we should create some to ensure we don’t run into errors due to running out of memory⁵.

...
cat > do-install.sh << FIN
#!/usr/bin/env bash
mkdir /tmp/dcos && cd /tmp/dcos

# resize the tmpfs to ensure there's space for the dcos install
sudo mount -t tmpfs -o remount,size=1G /tmp

# setup swap
if [ ! -f /swapfile ]; then
  sudo fallocate -l 2G /swapfile
  sudo chmod 600 /swapfile
  sudo mkswap /swapfile
  sudo swapon /swapfile
  echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
fi

printf "Waiting for installer to appear at Bootstrap URL"
...

We’re not making the /tmp changes permanent by modifying the fstab, so rebooting the instances will set the /tmp allocation back to normal as well as clearing out the installation files.

IPv6

It is 2016, and a sustainable Internet does mean we need to start using IPv6. By default, the DC/OS Terraform scripts do not enable IPv6. Adding the following setting to dcos.tf allows the public nodes to have IPv6 addresses. You may want to add this to the other node types if you wish to have them accessible via IPv6 as well.

...
resource "digitalocean_droplet" "dcos_public_agent" {
  name = "${format("${var.dcos_cluster_name}-public-agent-%02d", count.index)}"
  ipv6 = "true"
  depends_on = ["digitalocean_droplet.dcos_bootstrap"]
...

Thoughts on DC/OS

This tutorial simply covered installation of DC/OS. We have only touched the surface, and haven’t discussed running application containers, using marathon-lb for load balancing, volume management, or security and firewall settings for individual nodes. None of these tasks are trivial and deserve tutorials of their own.

Also, we only looked at Digital Ocean, but DC/OS does have official documentation for deployments on AWS, Azure, GCE and Packet. I’d recommend comparing each to reduce service cost.

I’ve seen DC/OS deployed in the wild in full production environments. It’s ability to schedule and manage tasks is very powerful. It also comes at a cost of a dedicated support and development team. If you’re a startup with strong development and operations engineers, setting up some kind of task or container orchestration, whether it’s DC/OS or something else, can help easy the pain of scaling out later. For smaller side projects, DC/OS seems prohibitive in both time and service costs.

DigitalOcean DC/OS Installation Guide. Retrieved 10 December 2016. DC/OS. ↩ ↩²
Install DC/OS with Vagrant. Retrieved 10 December 2016. DC/OS. ↩
Configuring Your Security. DC/OS Documentation. Retrieved 6 December 2016. Archive Version ↩
Opt-Out. DC/OS Documentation. Retrieved 6 December 2016. Archive Version ↩
How To Add Swap Space on Ubuntu 16.04. 26 April 2016. Digital Ocean. ↩