Load balancing and horizontal scaling
gunicorn does not do load balancing properly, see this PR and the issues mentioned.
This means while a large number of requests can be distributed more or less evenly statistically, the actually concurrency is far from the number of deployed workers. For example, if there are 4 pods with 4 workers per pod, totaling 16 workers, the actual concurrency may be much smaller than 16.
In the fastapi documentation, it it somehow recommended to use a single uvicorn process per pod and do scaling via at cluster level. This has been experimented and shows better response time compared to the existing deployment.
The current ClusterIP service uses a default probablity based rule (not a strict round robin style). The LoadBalancer may be useful.
Or maybe use the load balancer from nginx?