SYSOPS vs SRE
- SysOps practices and tooling overlap with DevOps.
- SRE (Site Reliability Engineering) is a subset of SysOps.
SysOps Overview
- Architecture Oversight: Coordinate with product managers and project managers.
- Defining, maintaining “static” infrastructure that cuts across applications and business units.
- Directory services (like LDAP, AD)
- Message busses (Kafka)
- Long living data storage (DBs)
- Providing self-service tools for DevOps so they can cycle entire dev, staging, and prod resources safely.
- Performance Efficiency
- Cost optimization
SRE Overview
- Disaster Prevention
- Fault Tolerance: ability for a system to remain in operation even if some of the components used to build the system fail.
- Resilience: ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.
- Auto Backups, onsite, offsite
- Continuous Monitoring/Testing for:
- Business-wide security
- threat detection
- Reliability: distributed system design, recovery planning, and adapting to changing requirements
- Performance emergencies, playbooks
- Cost overruns emergencies, playbooks
- Business-wide security
- Security
- Zero Trust
- Data encryption in transit, at rest
- Disaster Recovery
- Backup and Restore: slow, cheap
- Pilot Light
- Warm Standby
- Multi-site active/active: fast, expensive
- Disaster Testing: Causing real disasters on purpose
SysOps
-
Monitoring
-
Logging
- splunk: log mining.
- Elastic Logstash: log storage, transform
- Elasticsearch: log mining. Distributed, multitenant-capable full-text search engine with an HTTP web interface
- OpenSearch: Elasticsearch fork.
- Loggly log mining
- Nagios: monitoring, log storage and mining.
-
Visualization
-
Messaging
SRE (Site Reliability Engineering)
-
Continuous Automated Testing
-
- Continuous monitoring and Threat Detection
- AWS Security Hub
- Amazon GuardDuty
- Amazon Macie S3 PII exposure.
-
Chaos Engineering: actually break things, without warning
-
- Apache JMeter Written in Java.
- Locust Load testing written in Python.
- Gatling
- The Grinder Java
-
-
Network Security
IaC, EaC
-
Git Workflow for Ops Infrastructure
- GitOps = IaC + MRs + CI/CD
- Declaritive
- IaC docs
- Config docs
- Argos Workflow
- GitOps = IaC + MRs + CI/CD
-
Self Serve Automated Infrastructure For Devs
- Local Assets for Developers
- Remote Assets for Developers
- Integration Testing Assets for CI Build/Test Tools
- Staging Assets for CD
- supports QA, Acceptance Testing workflow
- AWS Control Tower
- AWS Organizations
-
Infrastructure Provisioning
-
Configuration Mgt.
- Ansible Redhat
- Chef Legacy
- Puppet Legacy
- AWS Systems Manager
Infrastructure
-
DNS
-
Load Balancers
-
CDN
-
API Mesh
-
File Systems, Storage
- Kubernetes Storage
- Amazon
- Amazon Elastic File System (EFS)
- Amazon S3: simple object storage service
- Amazon Elastic Block Storage (EBS)
- AWS Storage Gateway Hybrid on prem
- AWS Athena Query S3 data.
-
Event Buses
- Apache Kafka Distributed event streaming.
- AWS EventBridge Serverless. Between apps.
-
Messaging Queues
- RabbitMQ message broker
- AWS SQS (simple queue service)
- Apache ActiveMQ Java based message broker. JS, Python clients.
- Artemis: next gen
-
Data Analytics
- Apache Spark: stream and batch processing. 3rd gen.
- Apache Flink: event-driven apps, stream and batch analytics, pipelines, ETL. Newer than Spark. 4th gen. Auto optimize. Many options for state maintenance. Supports replay. Known for “Big Data” and “Stream Processing”
- Amazon Kinesis Data streams into storage.
-
Data Warehouse, Lake
- Data Warehouse: structured, filtered data that has already been processed for a specific purpose
- Data Lake: raw data
- AWS Redshift
- Google BigQuery Serverless cloud data warehouse.
- Snowflake
-
Data Routing and ETL
- Apache nifi: “Niagra Falls”, named by NSA. Known for “ETL” and “Data Integration”. DAGs for data routing and ETL. Low code. Web GUI. “Platform”
- Apache Camel Java enterprise integration “Framework”.
- Example Camel plus Kafka plus Nifi Java app uses camel to send messages to Kafka. Nifi consumes from Kafka.
- AWS Glue serverless data integration.
-
Identity Mgt.
-
Workflow Mgt., Event Scheduling
- Apache Airflow
- Orchestration Framework
- DAG: directed acyclic graphs. Vertices and edges.
- ETL
- ML training
- Spring Batch processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management.
- Luigi tasks, data pipelines, batch jobs. Written by Spotify. Python.
- CloudWatch Events
- Apache Airflow