Kubernetes development setup

Questions, comments, discussions. Over time certain topics might be moved to their own category.
Post Reply
alkacell
Posts: 1
Joined: Fri Jul 29, 2022 7:49 am

Kubernetes development setup

Post by alkacell »

First off thanks for an incredible project.

I saw a post where it was mentioned that the team behind Mayan uses Kubernetes clusters for development and testing running on aftermarket Dell servers.

Can you post more information about this setup? It would be awesome to see how you guys configure Kubernetes to your needs.

Thanks!
User avatar
franco
Developer
Developer
Posts: 42
Joined: Sun Apr 05, 2020 2:30 am

Re: Kubernetes development setup

Post by franco »

I'm responsible for the infrastructure supporting the commercial part of the project. I'll share the parts I know are ok.

Kubernetes:
At first we had a cool but complex set of ansible playbooks. We gradually switched to using Kubespray with Longhorn. Nodes run on KVM machines with dedicated block storage devices for runtime data and for persistent data either attached NFS volumes or connected object storage via MinIO. VMs are still provisioned with a custom code. I have my fingers crossed for official libvirt support in Terraform.

Hardware:
Initially a mismatch gang of custom PCs and over the course of about 18 months all systems were switched to Dell PowerEdge R620 and R720. The '20, 12th gen was selected because of the price to performance ratio. Still very good and more than enough for what we need. Power usage to performance ratio is a bit on the high side, but even with hardware from 2012 it is still cheaper and more performant to run than cloud, even VPS with a shared host.

The solution for the power consumption was to use an unofficial CPU for the '20s server, the E5-2648L V2 (https://ark.intel.com/content/www/us/en ... 0-ghz.html). This is a variant for telecom equipment meant to run 24/7/365 and it was designed for power efficiency above performance. This CPU is used for the R720s running storage and the R620s running backups jobs.

For CPU intensive loads, the E5-2697 V2 is used (https://ark.intel.com/content/www/us/en ... 0-ghz.html). This 2 of this CPU provides 48 CPU threads per 1U of rack space.

The Dell PowerEdge servers of this generation are very popular so there is no scarcity of parts and parts can be found at really good prices even with inflation. Having the same generation of equipment allows buying in bulk. Parts are also 100% interchangeable removing concern for downtime. Even if there are no spares, we can just transplant a part and get up and running again in a few minutes.

In terms of usable performance, aftermarket enterprise equipment still has a lot to offer. Our Kubernetes clusters routinely outperform Kubernetes clusters from the big providers using the same flavor configuration.

The current record for a single 6 node Kubernetes cluster is 183 million documents with a response time of 500ms with 1000 concurrent users.

Storage:
Storage is completely done via 10K RPM disks. With enough disks and flashed SAS cards in HBA mode the performance is still very good and the switch to SSD is not yet cost effective. Much less so for SAS NVMe drives.

Buying NetApp drives in bulk and formatting them to 512 sectors drives the price down to less than $10 per drive. We have managed to get the cost of drives so low that at the first SMART warning they are removed and no attempt is made to verify or rectify the error. It would actually cost more in time to do so than just remove and destroy the drive.

R720 do the bulk work in storage with PowerVault MD1220 JBOD to extend storage when needed or when migrating data to bigger disks.

Networking:
I can't share information on our topology. For management and control we use 1Gbe switches. 10Gbe DAC for bulk data in aggregation mode. DAC cables are used to keep costs and heat generation on the low side.

We have over 20 servers in our fleet in 7 different locations.

RAM:
As much as possible. Minimum memory configuration is 128GB. All modules are them same, 32GB LRDIMMs. Maximum configuration is 768GB.

Operating system:
Compute nodes have vanilla Ubuntu 22.04 LTS. Storage nodes use TrueNAS Scale.

Backups:
4-3-2-1 mode. Live data, 1 live synchronized copy for immediate recovery, 1 copy in the backup servers, 1 cold copy in LTO tapes, 1 cold copy in remote S3 storage. Bacula is used for backup (https://gitlab.com/mayan-edms/bacula).

Physical servers in the age of the cloud might not make much sense at first but there are many advantages and use cases where the cloud is not capable or cost effective medium to do so.

Things like load testing, stress testing, labs and client environment replications are too expensive to run in the cloud not to mention time consuming to setup.

At the first change possible free software projects should start to reinvest whatever budget they have into aftermarket enterprise equipment. It is a game changer.
Attachments
2022-08-29_23-53.png
2022-08-29_23-53.png (295.03 KiB) Viewed 141 times
2022-08-29_23-53_1.png
2022-08-29_23-53_1.png (89.18 KiB) Viewed 141 times
2022-08-29_23-51.png
2022-08-29_23-51.png (508.42 KiB) Viewed 143 times
2022-08-29_23-51_1.png
2022-08-29_23-51_1.png (176.05 KiB) Viewed 143 times
2022-08-29_23-49.png
2022-08-29_23-49.png (249.19 KiB) Viewed 145 times
2022-08-29_23-32.png
2022-08-29_23-32.png (243.08 KiB) Viewed 145 times
Post Reply