August 23, 2021


  • SysOps practices and tooling overlap with DevOps.
  • SRE (Site Reliability Engineering) is a subset of SysOps.

SysOps Overview

  • Architecture Oversight: Coordinate with product managers and project managers.
  • Defining, maintaining “static” infrastructure that cuts across applications and business units.
    • Directory services (like LDAP, AD)
    • Message busses (Kafka)
    • Long living data storage (DBs)
  • Providing self-service tools for DevOps so they can cycle entire dev, staging, and prod resources safely.
  • Performance Efficiency
  • Cost optimization

SRE Overview

  1. Disaster Prevention
  • Fault Tolerance: ability for a system to remain in operation even if some of the components used to build the system fail.
  • Resilience: ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.
    • Auto Backups, onsite, offsite
  • Continuous Monitoring/Testing for:
    • Business-wide security
      • threat detection
    • Reliability: distributed system design, recovery planning, and adapting to changing requirements
    • Performance emergencies, playbooks
    • Cost overruns emergencies, playbooks
  • Security
  1. Disaster Recovery
  • Backup and Restore: slow, cheap
  • Pilot Light
  • Warm Standby
  • Multi-site active/active: fast, expensive
  1. Disaster Testing: Causing real disasters on purpose


SRE (Site Reliability Engineering)

IaC, EaC


© 2022, Edward Pike
Built with Gatsby v4 in production mode.