7 Questions Developers Should Answer for Smooth SRE Coordination


Coordinating SRE with Development properly is essential for smooth operation, specifically in distributed environments.

To do this, we boiled down a few questions Developers should answer for the SRE team for any new service that they build:

1. How can I check the health of the service? This includes being able to ping endpoints securely and periodically to ensure the service is running smoothly.

2. How can I safely and gracefully restart the service? This includes ensuring that graceful shutdowns wait for inflight requests to finish and that there will be no disruption or performance degradation when the service is restarted.

3. How and why would the service fail? This includes understanding any external dependencies and what happens if they fail and having a playbook or sequence of steps to bring the service back up. The best is if this is fully automated AND there is documentation of the automation fails.

4. Are you using appropriate logging levels and formatting for your logs? This includes ensuring that TRACE and DEBUG levels are not used in production and that logs are written to stdout and in a consistent format (such as JSON or plaintext).

5. What kind of metrics are you exposing for the service? This should include RED signals such as rate, errors, and request duration.

6. Is there any documentation or design specifications for the service? This includes having an API contract (such as an OpenAPI or Swagger specification) and understanding how data flows through the service, including any potential PII or sensitive data.

7. What is the testing coverage for this service? This includes having unit and integration tests and an end-to-end test that can be run to identify issues. The template should be periodically reviewed, and developers should be encouraged to check the production logs of their services as well (as long as no direct user data is logged or it is anonymized).

Codesphere (codesphere.com) brings more velocity and accountability to the whole engineering org by connecting everyone on the same platform.