AWS Architecture Best Practices
Concise, practical notes for designing AWS systems that are reliable, secure, and cost-efficient — aligned with AWS Solutions Architect Professional standards.
1. Network Architecture
1.1 Designing a VPC
Think of your VPC as a city — each subnet is a district with its own purpose.
Subnet Type | Purpose | Example Components |
---|---|---|
Public | Internet-facing resources | ALB, NAT Gateway, Bastion Host |
Private | Internal app and API layers | EC2, ECS Tasks |
Database | Isolated storage layer | RDS, Aurora |
Management | Monitoring and admin tools | Prometheus, Grafana |
Example layout:
/16 VPC (65,536 IPs)
├── /20 public subnets – across 3 AZs
├── /20 private subnets – across 3 AZs
├── /24 database subnets – across 3 AZs
└── /24 management subnets – across 3 AZs
Design tips:
- Use at least two AZs (three preferred).
- Keep cross-AZ traffic low (reduces latency and cost).
- Use NAT Gateways for private subnet outbound access.
- Use VPC Endpoints to reach AWS services privately.
1.2 Network Security
Control | Scope | Behavior |
---|---|---|
Security Groups | Instance-level | Stateful, only “allow” rules. |
NACLs | Subnet-level | Stateless, supports allow and deny. |
Additional practices:
- Transit Gateway → central routing for multiple VPCs.
- PrivateLink / VPC Endpoints → avoid exposing services to the internet.
2. High Availability (HA)
2.1 Application HA
- ALB (L7) → smart routing, SSL termination, sticky sessions.
- NLB (L4) → static IPs, ultra-low latency.
- Auto Scaling → multi-AZ distribution, health checks, rolling updates.
2.2 Database HA
- RDS Multi-AZ → synchronous standby and failover.
- Read Replicas → async scaling for reads.
- RDS Proxy → efficient connection pooling.
- App-level retry and DNS failover logic recommended.
3. Redundancy & Disaster Recovery
3.1 Storage Redundancy
- S3 → cross-region replication.
- EBS → snapshots and lifecycle policies.
- EFS → multi-AZ replication.
- RDS → automated backups (7–35 days).
3.2 DR Strategies
Strategy | Description | RTO/RPO |
---|---|---|
Backup & Restore | Rebuild infrastructure from backups | High |
Pilot Light | Minimal standby infrastructure | Medium |
Warm Standby | Scaled-down live copy | Low |
Multi-site Active | Full duplication across regions | Very Low (highest cost) |
4. Security Architecture
4.1 Identity & Access
- Use IAM roles, not static keys.
- Enforce MFA and least-privilege policies.
- Use AssumeRole for cross-account access.
- Leverage service-linked roles for AWS services.
4.2 Encryption
- At rest → S3, EBS, RDS, EFS encryption.
- In transit → TLS 1.2 or higher.
- KMS → centralized key management.
- ACM → automatic certificate management.
5. Performance Optimization
- Use Graviton instances or right-size with Compute Optimizer.
- Prefer GP3 over GP2 for EBS.
- Apply S3 lifecycle rules → move cold data to Glacier.
- Reduce data transfer costs by staying in-region and private.
6. Cost Optimization
- Continuously right-size using CloudWatch + Compute Optimizer.
- Mix On-Demand, Reserved, and Spot instances.
- Archive cold data with S3 Glacier.
- Set up Budgets, Cost Explorer, and tagging for governance.
7. Monitoring & Observability
- CloudWatch → metrics, alarms, dashboards.
- Logs Insights → query logs across groups.
- X-Ray → distributed tracing.
- Synthetics → proactive canary checks.
8. Deployment & Operations
- Define infrastructure as code with CloudFormation or CDK.
- Automate pipelines via CodePipeline, CodeBuild, CodeDeploy.
- Deployment strategies:
- Blue/Green → minimal downtime
- Canary → gradual rollout
- Rolling → phased updates
- Immutable → brand-new instances
9. Common Patterns
Pattern | AWS Services | Notes |
---|---|---|
Microservices | API Gateway + ECS/EKS + SQS/SNS | Async, scalable design |
Serverless | Lambda + API Gateway | Pay per use |
Event-Driven | SQS, SNS, EventBridge | Decoupled services |
10. Design Principles (Rules of Thumb)
10.1 Availability Zones (AZs)
Nominal AZs = the number of AZs actively used in your architecture, leaving one buffer AZ for fault tolerance.
Formula:
Nominal AZs = Total AZs - 1
Instances per AZ = Required Instances ÷ Nominal AZs
Example:
You’re in a region with 6 AZs and your app needs 5 EC2 instances.
Nominal AZs = 6 - 1 = 5
Instances per AZ = 5 ÷ 5 = 1 instance per AZ
If one AZ fails, your app still runs evenly across 4 remaining AZs — maintaining stability and availability.
10.2 Subnets per Tier
Subnets = (Number of Tiers) × (Number of AZs)
Example:
2 tiers (app + DB) × 3 AZs = 6 subnets
10.3 Tiering Logic
- Traditional → 3-tier (Presentation / Logic / Data).
- In AWS → be requirements-driven.
- Private DB subnets → isolate for compliance and security.
- More subnets = better control, not automatically higher HA.
11. Best Practices Summary
Principle | Why It Matters |
---|---|
Design for failure | Expect AZ or instance failure — assume things will break and plan redundancy. |
Implement elasticity | Scale with demand using Auto Scaling and managed services. |
Automate with IaC | Use CloudFormation or CDK to reduce manual errors and enforce consistency. |
Monitor everything | Visibility ensures reliability — track metrics, logs, and alarms proactively. |
Optimize for cost | Efficiency drives sustainability — right-size, schedule, and review usage regularly. |
Common pitfalls:
- Running only in one AZ → single point of failure.
- Over-provisioning → wasted cost.
- Security added late → higher risk.
- No observability → blind troubleshooting.
- Tight coupling → poor scalability.