- SysOps practices and tooling overlap with DevOps.
- SRE (Site Reliability Engineering) is a subset of SysOps.
- Architecture Oversight: Coordinate with product managers and project managers.
- Defining, maintaining “static” infrastructure that cuts across applications and business units.
- Directory services (like LDAP, AD)
- Message busses (Kafka)
- Long living data storage (DBs)
- Providing self-service tools for DevOps so they can cycle entire dev, staging, and prod resources safely.
- Performance Efficiency
- Cost optimization
- Disaster Prevention
- Fault Tolerance: ability for a system to remain in operation even if some of the components used to build the system fail.
- Resilience: ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.
- Auto Backups, onsite, offsite
- Continuous Monitoring/Testing for:
- Business-wide security
- threat detection
- Reliability: distributed system design, recovery planning, and adapting to changing requirements
- Performance emergencies, playbooks
- Cost overruns emergencies, playbooks
- Business-wide security
- Zero Trust
- Data encryption in transit, at rest
- Disaster Recovery
- Backup and Restore: slow, cheap
- Pilot Light
- Warm Standby
- Multi-site active/active: fast, expensive
- Disaster Testing: Causing real disasters on purpose
Continuous Automated Testing
- Continuous monitoring and Threat Detection
- AWS Security Hub
- Amazon GuardDuty
- Amazon Macie S3 PII exposure.
Chaos Engineering: actually break things, without warning
Git Workflow for Ops Infrastructure
Self Serve Automated Infrastructure For Devs
File Systems, Storage
- Apache Spark: stream and batch processing. 3rd gen.
- Apache Flink: event-driven apps, stream and batch analytics, pipelines, ETL. Newer than Spark. 4th gen. Auto optimize. Many options for state maintenance. Supports replay. Known for “Big Data” and “Stream Processing”
- Amazon Kinesis Data streams into storage.
Data Warehouse, Lake
Data Routing and ETL
- Apache nifi: “Niagra Falls”, named by NSA. Known for “ETL” and “Data Integration”. DAGs for data routing and ETL. Low code. Web GUI. “Platform”
- Apache Camel Java enterprise integration “Framework”.
- Example Camel plus Kafka plus Nifi Java app uses camel to send messages to Kafka. Nifi consumes from Kafka.
- AWS Glue serverless data integration.
Workflow Mgt., Event Scheduling
- Apache Airflow
- Spring Batch processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management.
- Luigi tasks, data pipelines, batch jobs. Written by Spotify. Python.
- CloudWatch Events