The Third-Party Trap: How to Monitor the APIs You Don't Control

CE
Cloves Engineering
9 min read

Your real SLA is dictated by the weakest API in your dependency chain. If you rely on five third-party services with 99.9% uptime, your mathematical maximum uptime is actually 99.5%.

Introduction

Modern software development is largely an exercise in assembly. Instead of building everything from scratch, we stitch together specialized SaaS products: Stripe for payments, Twilio for SMS, SendGrid for emails, Algolia for search, and AWS S3 for storage.

This architecture allows small teams to build massively complex applications in record time. However, it introduces a severe operational vulnerability: you are accountable for the reliability of systems you do not own.

When your payment gateway starts dropping packets, or your transactional email provider experiences a 30-second latency spike, your internal infrastructure dashboards will look perfectly healthy. Your CPU is low, your memory is stable, and your internal network is humming. Yet, your users are staring at hanging loading spinners and failing transactions.

If your observability strategy only looks inward at your own servers, you are completely blind to the external dependencies that actually dictate your user experience.

What You Will Learn

  • Why relying on vendor status pages is a reactive, dangerous operational strategy.
  • The mechanics of Thread Starvation caused by third-party API degradation.
  • How to implement Egress Monitoring to catch vendor outages before they report them.
  • Practical implementation of the Circuit Breaker Pattern to prevent third-party failures from cascading into your own infrastructure.

Deep Dive

Why Vendor Status Pages Lie (By Omission)

When a critical workflow fails, the instinct of most engineering teams is to check the vendor's status page (e.g., status.stripe.com or status.aws.amazon.com). Usually, the page is a sea of green checkboxes.

There are three reasons why vendor status pages are unreliable during the first 30 minutes of an incident:

  1. Human Intervention: Most major status pages are not fully automated. They require an incident commander to manually flip the switch to "Degraded." This process often requires internal consensus and can take 15 to 45 minutes from the start of the actual failure.
  2. Global Aggregation: A vendor might have 99.99% global success rates, but if the specific regional edge node you are routed to (e.g., us-east-2) is failing, you are experiencing a 100% localized outage that will never register on their global dashboard.
  3. The "Soft Outage" (Latency): Vendors rarely report latency spikes as outages. If an API that usually takes 200ms suddenly takes 9 seconds, the vendor still considers it a "Successful 200 OK." But to your application, a 9-second delay is a hard timeout.

The Anatomy of Thread Starvation

A third-party failure doesn't just break the specific feature it powers; if left unchecked, it will crash your entire application. This happens through a process called Thread Starvation (or Connection Pool Exhaustion).

Imagine your backend is written in Node.js, Python, or Java, and configured to handle 1,000 concurrent requests.

  1. A user attempts to check out. Your server opens an HTTP connection to the Payment API.
  2. The Payment API is degraded and simply hangs, neither accepting nor rejecting the payload.
  3. Your server's request sits open, waiting for a response. This ties up one of your 1,000 available connection threads.
  4. As more users try to check out, more threads are tied up waiting on the dead third-party API.
  5. Within seconds, all 1,000 threads are locked in a "waiting" state.
  6. Now, when a user requests your homepage (which requires zero third-party APIs), your server cannot respond because it has no free threads to process the request.

A degraded external payment gateway just took down your entire website.

Implementing Egress Monitoring

To protect your system, you must monitor your external dependencies as rigorously as you monitor your internal microservices. This is called Egress Monitoring or Third-Party Synthetic Testing.

Instead of waiting for users to fail, your observability platform should actively ping your critical third-party endpoints from your own infrastructure's perspective.

Here is an example of an egress monitor configuration in Clovos, designed to verify the health of an external SMS provider API:

yaml
monitor_id: "egress_twilio_sms_api" type: "api_synthetic" endpoint: "[https://api.twilio.com/2010-04-01/Accounts/$](https://api.twilio.com/2010-04-01/Accounts/$){{ secrets.TWILIO_SID }}/Messages.json" method: "POST" interval_seconds: 30 request: headers: Authorization: "Basic ${{ secrets.TWILIO_AUTH_B64 }}" body: To: "+15550000000" # Test number From: "+15550000001" Body: "Synthetic Egress Check" assertions: - type: status_code # 400 is expected because we are using test credentials intentionally # If we get a 5xx or a timeout, the API is broken. value: 400 - type: latency_total operator: less_than value: 800ms

By running this check every 30 seconds, your team will be alerted to a third-party latency spike or failure immediately, long before the vendor updates their official status page.

The Circuit Breaker Pattern

Monitoring tells you when a dependency is broken, but you need a defensive architecture to automatically mitigate the damage. This is where the Circuit Breaker pattern comes in.

A circuit breaker wraps your outbound API calls in a state machine:

  1. Closed (Healthy): Traffic flows normally to the third-party API. The breaker monitors the failure rate and latency.
  2. Open (Failing): If the third-party API exceeds a failure threshold (e.g., 50% of requests fail or take longer than 2 seconds), the circuit "opens." All subsequent calls to this API are immediately aborted locally. Your server does not even attempt to connect to the vendor. It instantly returns a fallback response or a localized error to the user. This prevents Thread Starvation.
  3. Half-Open (Testing): After a cooldown period (e.g., 30 seconds), the breaker allows a single test request through. If it succeeds, the circuit closes (recovers). If it fails, the circuit opens again.

Here is an architectural example of how you might configure a circuit breaker (using a tool like Envoy Proxy or an in-code library like Resilience4j):

json
{ "circuit_breaker_name": "stripe_payment_gateway", "target_url": "api.stripe.com", "rules": { "error_threshold_percentage": 50, "timeout_ms": 1500, "volume_threshold": 10, "sleep_window_ms": 30000 }, "fallback_action": { "type": "return_local_response", "status_code": 503, "body": { "error": "Payment processing is temporarily degraded. Please try again in a few minutes." } } }

By failing fast locally (within milliseconds) rather than waiting 10 seconds for a broken vendor to respond, your application remains fast, your connection pool remains clear, and the rest of your platform stays online.

Conclusion

You cannot control the uptime of the third-party services you rely on, but you are absolutely responsible for how your application behaves when they fail.

Assuming your vendors will always be fast and available is an architectural flaw. By implementing proactive egress monitoring to detect vendor degradation instantly, and wrapping those dependencies in strict circuit breakers, you transform a potentially catastrophic system crash into a localized, gracefully handled degradation.

Take the next step: Audit your critical user paths (checkout, signup, login) and list every external API call involved. Configure a synthetic egress monitor for each of those vendor endpoints today, and ensure your application has a strict, short timeout configured for every outbound HTTP request.

Share this article