- A private VPC with subnets for workloads, ensuring no public IPs are
- Secures access to private workloads using Google-managed access policies.
- A management cluster with RBAC and GPU-optimized node pools.
- Centralized user and service account management.
- MySQL instance with private IP and integration with GKE.
- A GCS bucket and Filestore for durable and scalable storage.
- Decision: Use a private VPC with private subnets.
- Why: Private networking ensures that no workloads are exposed to the public internet, significantly reducing the attack surface.
- CIDR Block: Designed to allow future scaling the ammount of engineers working in the environment with room for additional subnets and integrations.
- Trade-Off: Setting up private access may require additional complexity in configuring Identity-Aware Proxy and private Google services.
- Improvement Opportunities:
- Use Shared VPC if there are multiple projects that require interconnectivity.
- Enable VPC Service Controls to protect against data exfiltration.
- Decision: Secure all workloads behind IAP.
- Why: Provides a centralized access control layer with fine-grained permissions.
- Trade-Off: Relies on Google-managed services, which may introduce operational overhead if custom authentication is required.
- Improvement Opportunities:
- Integrate IAP with an existing enterprise identity provider (e.g., Okta, Active Directory).
- Configure access logs and monitoring for compliance and troubleshooting.
- Decision: Deploy a single GKE cluster for managing other clusters.
- Why: Centralized control for cluster management simplifies scaling and ensures consistency.
- RBAC: Assign roles to service accounts and users for fine-grained access.
- GPU Node Pool: Uses
a2-highgpu-1goptimized for trading model training. - Trade-Off:
- Managing all clusters through one GKE cluster may introduce a single point of failure.
- GPU costs are higher; workloads should be optimized for GPU utilization.
- Improvement Opportunities:
- Enable Vertical Pod Autoscaler (VPA) for GPU nodes to optimize resource usage.
- Use Node Affinity to isolate workloads to specific node pools for security and efficiency.
- Decision: Manage users and service accounts via Terraform.
- Why: Centralized IAM management ensures consistent and auditable access control.
- Trade-Off:
- New IAM roles or permissions require Terraform changes, which may slow down immediate access adjustments.
- Improvement Opportunities:
- More granular control over permissions for each kind of user through code.
- Use IAM Conditions for time-based or attribute-based access policies.
- Integrate with Google Workspace Groups to simplify user role assignments.
- Decision: Use a MySQL instance without high availability.
- Why: Reduces cost while meeting current requirements.
- Private IP: Ensures that the database is accessible only within the VPC.
- Trade-Off:
- No HA means downtime in case of instance failure; manual failover or recovery would be required.
- Improvement Opportunities:
- Enable Point-in-Time Recovery for disaster recovery.
- Consider Cloud SQL Insights for monitoring and performance tuning.
- Decision: Use a GCS bucket for object storage and Filestore for shared file access.
- Why: Combines scalability with high availability.
- Trade-Off:
- Filestore is regional and requires manual replication for multi-region resilience.
- Improvement Opportunities:
- Use Bucket Lock for compliance requirements (WORM policy).
- Implement Backup Policies for Filestore.
- Networking:
- Use Firewall Rules to restrict access to specific subnets or IPs.
- Enable Private Google Access for API calls from the VPC.
- IAM:
- Implement Least Privilege for service accounts and users.
- Regularly rotate keys and monitor service account usage.
- Data Protection:
- Use Customer-Managed Encryption Keys (CMEK) for sensitive workloads.
- Enable Cloud DLP for identifying sensitive data in GCS.
- Monitoring and Auditing:
- Enable Cloud Logging and Cloud Monitoring for all resources.
- Use Security Command Center for proactive threat detection.
- Networking:
- Reserve additional CIDR blocks for future subnets and integrations.
- Use Cloud Interconnect or VPN for hybrid environments.
- GKE:
- Implement Horizontal Pod Autoscaler (HPA) for application workloads.
- Use Workload Identity instead of service account keys for secure authentication.
- CloudSQL:
- Plan for HA or a managed read-replica configuration if workloads grow.
- Storage:
- Transition to Coldline Storage for archived data to reduce costs.
The READMEPART2.md will:
- Explain Architecture: High-level overview of components and their purpose.
- Describe Module Usage: Briefly describe how to implement the code
- Provide Runbooks (TO DO):
- Steps for common operations (e.g., scaling GPU nodes, adding IAM users).
- Recovery procedures for CloudSQL and Filestore.
