• Introduction: key concepts

    Introduction: key concepts

    In this initial section of the course, we are going to briefly review the key concepts related with EC3. You can go directly to the section of your interest. Moreover, if you are familiar with all these concepts, you can directly go to the next section in the course (Elastic Cloud Computing Cluster), and/or have fun with the little questionnaire prepared at the end of this section.

    1.- Virtualization & Cloud computing

    These two concepts are deeply related, and are the key computing paradigms behind EC3. Virtualization is defined by NIST (National Institute of Standards and Technology) as "the simulation of the software and/or hardware upon which other software runs.” And what are the benefits of such simulation instead of using directly the hardware resources? As NIST states, "the main advantage of full virtualization is its ability to maximize the use of a system’s resources. By loading the system with multiple operating systems and services, no processing or memory power goes to waste.".

    There are several types of virtualization, from application virtualization, that provides the ability to run server applications on user's desktop; to full virtualization, that provides a complete simulation of the underlying hardware. In the middle we can find also paravirtualization, that provides a partial simulation of the hardware of a physical server; or specific resource virtualization, like storage or network virtualization. The key component in virtualization is the hypervisor. A hypervisor, also known as a virtual machine monitor or VMM, is the software that creates and runs Virtual Machines (VMs). A hypervisor allows one host computer to support multiple guest VMs by virtually sharing its resources, such as memory and processing. To run a VM, the hypervisor uses the Virtual Machine Images (VMI), that are files comprising the operative system to emulate. Let's watch a video that illustrates all these concepts:



    Regarding Cloud Computing, we also take a look at the NIST definition for this paradigm (full document available here), defined as "a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. "

    • Cloud Computing’s five essential characteristics: On-demand self-service, Broad network access, Resource pooling, Rapid Elasticity, and Measured service.

    • The three service models for Cloud Computing are: Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS).

    • Cloud Computing deployment models: private cloud, community cloud (also federated cloud), public cloud, and hybrid cloud.

    You can now watch the next video prepared by one of the most popular public cloud providers: Amazon Web Services. Enjoy it!



    2.- Clusters and Local Resource Management Systems

    According to Wikipedia "a computer cluster is a set of computers that work together so that they can be viewed as a single system". All these computers are inter-connected to each other through fast local area networks allowing them to work together with the ability to perform computationally intensive tasks. In a cluster, each computer is referred to as a "node". If all the nodes have the same physical characteristics (i.e., same number of CPUs or GPUs, RAM memory, disk,...) and the same OS, we have an homogeneous cluster. However, diversity is also allowed and we will have then an heterogeneous cluster. Notice that a cluster does not have to be composed by physical machines, it can be also deployed on a Cloud Computing platform, conforming a virtual cluster composed by virtual machines. This is what the EC3 tool provides to its users.

    Typically, a cluster has a small number of front-end nodes, usually one or two (for fault tolerance purposes), and a large number of compute nodes or working nodes. The front-end node is the computer to which the user logs in, and where he/she edits scripts, compiles code, and submits jobs. 

    The jobs are automatically run on the compute nodes by the Local Resource Management System (LRMS) that is the software able to schedule tasks and manage the nodes that compose the cluster. There are several LRMS that are used both in science and business environments. The most used and well-known ones are: 

    • SLURM is a workload manager software designed specifically to satisfy the demanding needs of high performance computing (HPC). It is free and open-source, what facilitates its usage at government laboratories, universities and companies world wide. Slurm is highly configurable: it comes with a set of  optional plugins that provide the functionality needed to satisfy the needs of demanding HPC centers. 

    • Torque is a resource manager that provides control over batch jobs and distributed computing resources. TORQUE can integrate with the non-commercial Maui Cluster Scheduler or the commercial Moab Workload Manager to improve overall utilization, scheduling and administration on a cluster.

    • Kubernetes (also known as K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery. 

    • Apache Mesos is a distributed systems kernel that abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.

    • HTCondor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, HTCondor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to HTCondor, HTCondor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.

    This is a very brief description of each LRMS, you can go through the links to the official webpage of each tool, to know more details about them.

    Image from Jhon Voo Flickr account.

    3.- Infrastructure-as-code  tools

    "A long time ago, in a data center far, far away, an ancient group of powerful beings known as sysadmins used to deploy infrastructure manually. Every server, every route table entry, every database configuration, and every load balancer was created and managed by hand. It was a dark and fearful age: fear of downtime, fear of accidental misconfiguration, fear of slow and fragile deployments, and fear of what would happen if the sysadmins fell to the dark side (i.e. took a vacation). The good news is that thanks to the DevOps Rebel Alliance, we now have a better way to do things:Infrastructure-as-Code(IAC)."     Source https://blog.gruntwork.io/.

    Infrastructure as code (IaC) is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools. 

    This has a number of benefits:

    • You can automate your entire provisioning and deployment process, which makes it much faster and more reliable than any manual process.

    • You can store those source files in version control, which means the entire history of your infrastructure is now captured in the commit log, which you can use to debug problems, and if necessary, roll back to older versions.

    • You can validate each infrastructure change through code reviews and automated tests.

    • You can create a library of reusable, documented, battle-tested infrastructure code that makes it easier to scale and evolve your infrastructure.

    There are several tools to manage infrastructure-as-code, but the most well-known ones are Ansible (this is the one used by EC3), PuppetChefSaltstackTerraform and CloudFormation. You can follow the links to know more details about these tools. Also we recommend you to see the following video as a summary of some of these tools:



    Now, you can go to this brief questionnaire to evaluate yourself against these introductory key concepts: