AWS Architecture Best Practices
Concise, practical notes for designing AWS systems that are reliable, secure, and cost-efficient — aligned with AWS Solutions Architect Professional standards.
1. Network Architecture
1.1 Designing a VPC
Think of your VPC as a city — each subnet is a district with its own purpose.
| Subnet Type | Purpose | Example Components |
|---|---|---|
| Public | Internet-facing resources | ALB, NAT Gateway, Bastion Host |
| Private | Internal app and API layers | EC2, ECS Tasks |
| Database | Isolated storage layer | RDS, Aurora |
| Management | Monitoring and admin tools | Prometheus, Grafana |
Example layout:
/16 VPC (65,536 IPs)
├── /20 public subnets – across 3 AZs
├── /20 private subnets – across 3 AZs
├── /24 database subnets – across 3 AZs
└── /24 management subnets – across 3 AZs
Design tips:
- Use at least two AZs (three preferred).
- Keep cross-AZ traffic low (reduces latency and cost).
- Use NAT Gateways for private subnet outbound access.
- Use VPC Endpoints to reach AWS services privately.
1.2 Network Security
| Control | Scope | Behavior |
|---|---|---|
| Security Groups | Instance-level | Stateful, only “allow” rules. |
| NACLs | Subnet-level | Stateless, supports allow and deny. |
Additional practices:
- Transit Gateway → central routing for multiple VPCs.
- PrivateLink / VPC Endpoints → avoid exposing services to the internet.
2. High Availability (HA)
2.1 Application HA
- ALB (L7) → smart routing, SSL termination, sticky sessions.
- NLB (L4) → static IPs, ultra-low latency.
- Auto Scaling → multi-AZ distribution, health checks, rolling updates.
2.2 Database HA
- RDS Multi-AZ → synchronous standby and failover.
- Read Replicas → async scaling for reads.
- RDS Proxy → efficient connection pooling.
- App-level retry and DNS failover logic recommended.
3. Redundancy & Disaster Recovery
3.1 Storage Redundancy
- S3 → cross-region replication.
- EBS → snapshots and lifecycle policies.
- EFS → multi-AZ replication.
- RDS → automated backups (7–35 days).
3.2 DR Strategies
| Strategy | Description | RTO/RPO |
|---|---|---|
| Backup & Restore | Rebuild infrastructure from backups | High |
| Pilot Light | Minimal standby infrastructure | Medium |
| Warm Standby | Scaled-down live copy | Low |
| Multi-site Active | Full duplication across regions | Very Low (highest cost) |
4. Security Architecture
4.1 Identity & Access
- Use IAM roles, not static keys.
- Enforce MFA and least-privilege policies.
- Use AssumeRole for cross-account access.
- Leverage service-linked roles for AWS services.
4.2 Encryption
- At rest → S3, EBS, RDS, EFS encryption.
- In transit → TLS 1.2 or higher.
- KMS → centralized key management.
- ACM → automatic certificate management.
5. Performance Optimization
- Use Graviton instances or right-size with Compute Optimizer.
- Prefer GP3 over GP2 for EBS.
- Apply S3 lifecycle rules → move cold data to Glacier.
- Reduce data transfer costs by staying in-region and private.
6. Cost Optimization
- Continuously right-size using CloudWatch + Compute Optimizer.
- Mix On-Demand, Reserved, and Spot instances.
- Archive cold data with S3 Glacier.
- Set up Budgets, Cost Explorer, and tagging for governance.
7. Monitoring & Observability
- CloudWatch → metrics, alarms, dashboards.
- Logs Insights → query logs across groups.
- X-Ray → distributed tracing.
- Synthetics → proactive canary checks.
8. Deployment & Operations
- Define infrastructure as code with CloudFormation or CDK.
- Automate pipelines via CodePipeline, CodeBuild, CodeDeploy.
- Deployment strategies:
- Blue/Green → minimal downtime
- Canary → gradual rollout
- Rolling → phased updates
- Immutable → brand-new instances
9. Common Patterns
| Pattern | AWS Services | Notes |
|---|---|---|
| Microservices | API Gateway + ECS/EKS + SQS/SNS | Async, scalable design |
| Serverless | Lambda + API Gateway | Pay per use |
| Event-Driven | SQS, SNS, EventBridge | Decoupled services |
10. Design Principles (Rules of Thumb)
10.1 Availability Zones (AZs)
Nominal AZs = the number of AZs actively used in your architecture, leaving one buffer AZ for fault tolerance.
Formula:
Nominal AZs = Total AZs - 1
Instances per AZ = Required Instances ÷ Nominal AZs
Example:
You’re in a region with 6 AZs and your app needs 5 EC2 instances.
Nominal AZs = 6 - 1 = 5
Instances per AZ = 5 ÷ 5 = 1 instance per AZ
If one AZ fails, your app still runs evenly across 4 remaining AZs — maintaining stability and availability.
10.2 Subnets per Tier
Subnets = (Number of Tiers) × (Number of AZs)
Example:
2 tiers (app + DB) × 3 AZs = 6 subnets
10.3 Tiering Logic
- Traditional → 3-tier (Presentation / Logic / Data).
- In AWS → be requirements-driven.
- Private DB subnets → isolate for compliance and security.
- More subnets = better control, not automatically higher HA.
11. Best Practices Summary
| Principle | Why It Matters |
|---|---|
| Design for failure | Expect AZ or instance failure — assume things will break and plan redundancy. |
| Implement elasticity | Scale with demand using Auto Scaling and managed services. |
| Automate with IaC | Use CloudFormation or CDK to reduce manual errors and enforce consistency. |
| Monitor everything | Visibility ensures reliability — track metrics, logs, and alarms proactively. |
| Optimize for cost | Efficiency drives sustainability — right-size, schedule, and review usage regularly. |
Common pitfalls:
- Running only in one AZ → single point of failure.
- Over-provisioning → wasted cost.
- Security added late → higher risk.
- No observability → blind troubleshooting.
- Tight coupling → poor scalability.
12. Migration & Modernization Notes
- Phases – Assess → Mobilize → Migrate & Modernize; gather business case, build landing zone, then execute wave plans.
- 6 Rs decision matrix – Rehost (lift/shift), Replatform (minimal tweaks), Refactor (cloud-native), Repurchase (SaaS), Retire, Retain. Match effort vs benefit.
- Discovery & planning – Application Discovery Service (ADS) inventories servers/dependencies feeding Migration Hub wave plans.
- Workload movement:
- MGN (Application Migration Service) – block-level replication for lift/shift; handles cutover tests.
- SMS – incremental replication for on-prem VMware/Hyper-V to EC2.
- DMS – homogeneous/heterogeneous database migrations plus CDC replication.
- ECS Anywhere / EKS Anywhere – run AWS-managed containers on-prem to ease phased migrations.
- Governance – Control Tower builds multi-account landing zones with guardrails (note: it does not grant IAM access automatically).
13. Identity, Security & Governance Notes
- SCPs set maximum permissions; combine with IAM policies for effective rights. Remember that SCP evaluation is cumulative: Root SCPs apply to every OU/account, child OUs inherit parent SCPs, and explicit denies in any ancestor override everything beneath.
- Parameter Store vs Secrets Manager – config/simple secrets vs sensitive secrets with rotation/integration.
- IAM role vs instance profile – role = permission set; instance profile = container that attaches a role to EC2.
- Web identity federation – OIDC/SAML apps call
AssumeRoleWithWebIdentityfor short-lived creds. - KMS – central envelope encryption (EBS/S3/RDS); import external keys when compliance demands.
- Macie / GuardDuty / Inspector / Security Hub – data classification, threat intel, vuln scanning, findings aggregation.
- AWS Config – drift/compliance detection; remediate via Lambda/SSM (Config itself is read-only).
- Service Catalog – standardized products with launch constraints/tags across accounts.
13.1 Identity Center & Directories
- IAM Identity Center (AWS SSO) – one identity source at a time: built-in directory, AWS Managed AD/AD Connector, or external IdP (SAML/SCIM).
- AD Connector for SSO – when using on-prem AD via connector you cannot simultaneously plug in a third-party IdP; pick one source.
- AWS Managed Microsoft AD trusts – support external and forest trusts (one/both directions) with on-prem AD or another managed AD.
- Granting management account access – invited accounts create
OrganizationAccountAccessRoleso the management account can assume it for break-glass admin.