August 23, 2021

SysOps vs SRE vs DevOps

  • SysOps practices and tooling overlap with DevOps.
  • SRE (Site Reliability Engineering) is a subset of SysOps.
  • SysOps vs DevOps

SysOps Overview

Systems Operation Engineers manage physical and cloud infrastructure. SysOps follow the ITIL (Information Technology Infrastructure Library) approach. They deal with Patch mgt. IAC. Hypervisors and VMs, no big whoop.

  • Architecture Oversight: Coordinate with product managers and project managers.
  • Defining and maintaining “static” infrastructure that cuts across applications and business units.
    • Directory services (like LDAP, AD)
    • Message busses (Kafka, RabbitMQ)
    • Long living data storage (DBs)
    • Long lived VMs that may host containers that are managed by Kubernetes.
  • Providing self-service tools for DevOps so they can cycle entire dev, staging, and prod resources safely. See also Platform Engineering.
  • Performance Efficiency
  • Cost optimization

SRE Overview

Site Reliability Engineers write automation code to increase stability and performance of systems. Focus on SLIs, SLAs, and SLOs. SRE role created about 2016 to fill gap between SysOps and DevOps.

  1. Disaster Prevention
  • Fault Tolerance: ability for a system to remain in operation even if some of the components used to build the system fail.
  • Resilience: ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.
    • Auto Backups, onsite, offsite
  • Continuous Monitoring/Testing for:
    • Business-wide security
      • threat detection
    • Reliability: distributed system design, recovery planning, and adapting to changing requirements
    • Performance emergencies, playbooks
    • Cost overruns emergencies, playbooks
  • Security
  1. Disaster Recovery
  • Backup and Restore: slow, cheap
  • Pilot Light
  • Warm Standby
  • Multi-site active/active: fast, expensive
  1. Disaster Testing: Causing real disasters on purpose


SRE (Site Reliability Engineering)

IaC, EaC


AWS Outposts On prem AWS cloud.


Load Balancers

  • Cloudflare

  • HashiCorp Envoy

  • Azure

    • App Gateway
    • Traffic Mngr
    • Load Balancer
  • Google

    • Cloud Traffic Director
    • Cloud Load Balancer
  • AWS

    • Gateway Load Balancer
    • Elastic Load Balancing
  • VMWare NSX

  • Fastly Edge

  • NGINX: Owned by F5

  • Barracuda

  • A10 Thunder

  • Kubernetes Ingress, etc, see Containers page.


CDN (Content Delivery Network)


Storage: File, Block, Object

File storage, block storage, or object storage?

Event Buses

Messaging Queues

Data Analytics

Data Warehouse, Lake

Data Routing and ETL

  • Apache nifi: “Niagra Falls”, named by NSA. Known for “ETL” and “Data Integration”. DAGs for data routing and ETL. Low code. Web GUI. “Platform”
  • Apache Camel Java enterprise integration “Framework”.
  • Example Camel plus Kafka plus Nifi Java app uses camel to send messages to Kafka. Nifi consumes from Kafka.
  • AWS Glue serverless data integration.

Identity Mgt.

Workflow Mgt., Event Scheduling

  • Apache Airflow
  • Spring Batch processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management.
  • Luigi tasks, data pipelines, batch jobs. Written by Spotify. Python.
  • CloudWatch Events NOT generic CloudWatch.