SysOps

August 23, 2021

SysOps vs SRE vs DevOps

SysOps practices and tooling overlap with DevOps.
SRE (Site Reliability Engineering) is a subset of SysOps.
SysOps vs DevOps

SysOps Overview

Systems Operation Engineers manage physical and cloud infrastructure. SysOps follow the ITIL (Information Technology Infrastructure Library) approach. They deal with Patch mgt. IAC. Hypervisors and VMs, no big whoop.

Architecture Oversight: Coordinate with product managers and project managers.
Defining and maintaining “static” infrastructure that cuts across applications and business units.
- Directory services (like LDAP, AD)
- Message busses (Kafka, RabbitMQ)
- Long living data storage (DBs)
- Long lived VMs that may host containers that are managed by Kubernetes.
Providing self-service tools for DevOps so they can cycle entire dev, staging, and prod resources safely. See also Platform Engineering.
Performance Efficiency
Cost optimization

SRE Overview

Site Reliability Engineers write automation code to increase stability and performance of systems. Focus on SLIs, SLAs, and SLOs. SRE role created about 2016 to fill gap between SysOps and DevOps.

Disaster Prevention

Fault Tolerance: ability for a system to remain in operation even if some of the components used to build the system fail.
Resilience: ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.
- Auto Backups, onsite, offsite
Continuous Monitoring/Testing for:
- Business-wide security
  - threat detection
- Reliability: distributed system design, recovery planning, and adapting to changing requirements
- Performance emergencies, playbooks
- Cost overruns emergencies, playbooks
Security
- Zero Trust
- Data encryption in transit, at rest

Disaster Recovery

Backup and Restore: slow, cheap
Pilot Light
Warm Standby
Multi-site active/active: fast, expensive

Disaster Testing: Causing real disasters on purpose

Principles of Chaos Engineering
Chaos engineering was basically invented by Netflix.

SysOps

Monitoring
- Prometheus
- AWS CloudWatch
- New Relic
- dynatrace
- appDynamics
- DataDog
- WireShark (sorta)
Logging
- splunk: log mining.
- Elastic Logstash: log storage, transform
- Elasticsearch: log mining. Distributed, multitenant-capable full-text search engine with an HTTP web interface
  - OpenSearch: Elasticsearch fork.
- Loggly log mining
- Nagios: monitoring, log storage and mining.
Visualization
- Grafana
- Elastic Kibana: wiki
Messaging
- pagerduty
- Slack
- AWS SNS

SRE (Site Reliability Engineering)

Continuous Automated Testing
- AWS Cloud Security
  - Continuous monitoring and Threat Detection
  - AWS Security Hub
  - Amazon GuardDuty
  - Amazon Macie S3 PII exposure.
- Chaos Engineering: actually break things, without warning
- Load Testing
  - Apache JMeter Written in Java.
    - Blazemeter Plugin
  - Locust Load testing written in Python.
  - Gatling
  - The Grinder Java
- Pen Testing
  - Tools
Network Security
- Kerberos

IaC, EaC

Infrastructure as Code (IaC)
Everything as Code (EaC)
Git Workflow for Ops Infrastructure
- GitOps = IaC + MRs + CI/CD
  - Declaritive
  - IaC docs
  - Config docs
- Argos Workflow
Self Serve Automated Infrastructure For Devs
- Local Assets for Developers
- Remote Assets for Developers
- Integration Testing Assets for CI Build/Test Tools
- Staging Assets for CD
  - supports QA, Acceptance Testing workflow
- AWS Control Tower
- AWS Organizations
Infrastructure Provisioning
- TerraForm HashiCorp
- CrossPlane: Terraform vs CrossPlane
- Pulumi
- AWS CloudFormation
- AWS Elastic Beanstalk
- AWS CDK Cloud Development Kit. AWS version of Pulumi?
Configuration Mgt.
- Ansible Redhat. Playbooks.
- Chef Legacy
- Puppet Legacy
- AWS Systems Manager

Infrastructure

AWS Outposts On prem AWS cloud.

DNS

Load Balancers

Software

Cloudflare
HashiCorp Envoy
Azure
- App Gateway
- Traffic Mngr
- Load Balancer
Google
- Cloud Traffic Director
- Cloud Load Balancer
AWS
- Gateway Load Balancer
- Elastic Load Balancing
VMWare NSX
Fastly Edge
NGINX: Owned by F5
Barracuda
A10 Thunder
Kubernetes Ingress, etc, see Containers page.

Hardware

F5
Citrix ADC appliance

CDN (Content Delivery Network)

APIs

API Gateways
- AWS API Gateway
- GCP Apigee
Service Mesh
- Istio Google, IBM.
- Linkerd Rust. Integrates with Traefik, Kong and Gloo Edge.
- Traefik Mesh
- Hashi Consul Connect
- AWS App Mesh
- Apache ServicComb
- Kuma nee Kong

Storage: File, Block, Object

File storage, block storage, or object storage?

Kubernetes Storage (see Containers)
Amazon
- Amazon Elastic File System (EFS)
- Amazon S3: simple object storage service
- Amazon Elastic Block Storage (EBS)
- AWS Storage Gateway Hybrid on prem
- AWS Athena Query S3 data.

Event Buses

Apache Kafka Distributed event streaming.
AWS EventBridge Serverless. Between apps.

Messaging Queues

RabbitMQ message broker
AWS SQS (simple queue service)
Apache ActiveMQ Java based message broker. JS, Python clients.
- Artemis: next gen

Data Analytics

Apache Spark: stream and batch processing. 3rd gen.
Apache Flink: event-driven apps, stream and batch analytics, pipelines, ETL. Newer than Spark. 4th gen. Auto optimize. Many options for state maintenance. Supports replay. Known for “Big Data” and “Stream Processing”
Amazon Kinesis Data streams into storage.

Data Warehouse, Lake

Data Warehouse: structured, filtered data that has already been processed for a specific purpose
Data Lake: raw data
AWS Redshift
Google BigQuery Serverless cloud data warehouse.
Snowflake

Data Routing and ETL

Apache nifi: “Niagra Falls”, named by NSA. Known for “ETL” and “Data Integration”. DAGs for data routing and ETL. Low code. Web GUI. “Platform”
Apache Camel Java enterprise integration “Framework”.
Example Camel plus Kafka plus Nifi Java app uses camel to send messages to Kafka. Nifi consumes from Kafka.
AWS Glue serverless data integration.

Identity Mgt.

LDAP
- MS Active Directory
- Azure AD Cloud Active Directory
AWS Cognito
Google Cloud Directory Sync

Workflow Mgt., Event Scheduling

Apache Airflow
- Orchestration Framework
- DAG: directed acyclic graphs. Vertices and edges.
- ETL Extract, transform, load.
Spring Batch processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management.
Luigi tasks, data pipelines, batch jobs. Written by Spotify. Python.
CloudWatch Events NOT generic CloudWatch.