- SysOps practices and tooling overlap with DevOps.
- SRE (Site Reliability Engineering) is a subset of SysOps.
- SysOps vs DevOps
Systems Operation Engineers manage physical and cloud infrastructure. SysOps follow the ITIL (Information Technology Infrastructure Library) approach. They deal with Patch mgt. IAC. Hypervisors and VMs, no big whoop.
- Architecture Oversight: Coordinate with product managers and project managers.
- Defining and maintaining “static” infrastructure that cuts across applications and business units.
- Directory services (like LDAP, AD)
- Message busses (Kafka, RabbitMQ)
- Long living data storage (DBs)
- Long lived VMs that may host containers that are managed by Kubernetes.
- Providing self-service tools for DevOps so they can cycle entire dev, staging, and prod resources safely. See also Platform Engineering.
- Performance Efficiency
- Cost optimization
Site Reliability Engineers write automation code to increase stability and performance of systems. Focus on SLIs, SLAs, and SLOs. SRE role created about 2016 to fill gap between SysOps and DevOps.
- Disaster Prevention
- Fault Tolerance: ability for a system to remain in operation even if some of the components used to build the system fail.
- Resilience: ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.
- Auto Backups, onsite, offsite
- Continuous Monitoring/Testing for:
- Business-wide security
- threat detection
- Reliability: distributed system design, recovery planning, and adapting to changing requirements
- Performance emergencies, playbooks
- Cost overruns emergencies, playbooks
- Business-wide security
- Zero Trust
- Data encryption in transit, at rest
- Disaster Recovery
- Backup and Restore: slow, cheap
- Pilot Light
- Warm Standby
- Multi-site active/active: fast, expensive
- Disaster Testing: Causing real disasters on purpose
Continuous Automated Testing
- Continuous monitoring and Threat Detection
- AWS Security Hub
- Amazon GuardDuty
- Amazon Macie S3 PII exposure.
Chaos Engineering: actually break things, without warning
Git Workflow for Ops Infrastructure
Self Serve Automated Infrastructure For Devs
AWS Outposts On prem AWS cloud.
- App Gateway
- Traffic Mngr
- Load Balancer
- Cloud Traffic Director
- Cloud Load Balancer
- Gateway Load Balancer
- Elastic Load Balancing
NGINX: Owned by F5
Kubernetes Ingress, etc, see Containers page.
CDN (Content Delivery Network)
- API Gateways
- Service Mesh
Storage: File, Block, Object
- Kubernetes Storage (see Containers)
- RabbitMQ message broker
- AWS SQS (simple queue service)
- Apache ActiveMQ Java based message broker. JS, Python clients.
- Artemis: next gen
- Apache Spark: stream and batch processing. 3rd gen.
- Apache Flink: event-driven apps, stream and batch analytics, pipelines, ETL. Newer than Spark. 4th gen. Auto optimize. Many options for state maintenance. Supports replay. Known for “Big Data” and “Stream Processing”
- Amazon Kinesis Data streams into storage.
Data Warehouse, Lake
- Data Warehouse: structured, filtered data that has already been processed for a specific purpose
- Data Lake: raw data
- AWS Redshift
- Google BigQuery Serverless cloud data warehouse.
Data Routing and ETL
- Apache nifi: “Niagra Falls”, named by NSA. Known for “ETL” and “Data Integration”. DAGs for data routing and ETL. Low code. Web GUI. “Platform”
- Apache Camel Java enterprise integration “Framework”.
- Example Camel plus Kafka plus Nifi Java app uses camel to send messages to Kafka. Nifi consumes from Kafka.
- AWS Glue serverless data integration.
Workflow Mgt., Event Scheduling
- Apache Airflow
- Spring Batch processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management.
- Luigi tasks, data pipelines, batch jobs. Written by Spotify. Python.
- CloudWatch Events NOT generic CloudWatch.