feat(cache) Add a django cache adapter that reconnects#104816
feat(cache) Add a django cache adapter that reconnects#104816
Conversation
We often see load-imbalances in our memcache clusters. Application containers connect through a pool of twemproxy instances, and because application containers are long-lived and only connect on startup, we can end up with imbalanced load. By periodically disconnecting and reconnecting to twemproxy we have a better chance of getting more even load distribution. Refs PRODENG-702
| if age >= self._reconnect_age: | ||
| # Close the underlying cache connection if we haven't done that recently. | ||
| metrics.incr("cache.memcache.reconnect") | ||
| self._backend.close() | ||
|
|
||
| return self._backend |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| def get(self, key, default=None, version=None): | ||
| return self._get_backend().get(key, version=version) |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
mwarkentin
left a comment
There was a problem hiding this comment.
LGTM, not sure about the cursor comments.
If those aren't a concern lets merge and proceed with some testing.
I'll give that a shot. |
Using a subclass means we can override an internal factory method and be less concered with the public interface of cache adapters.
cdeb757 to
47a0f6a
Compare
There is a possibility of multiple threads attempting to reconnect to memcache concurrently. Use a lock to avoid races.
| self._backend = None | ||
|
|
||
| metrics.incr("cache.memcache.reconnect") | ||
| self._backend = self._class(self.client_servers, **self._options) | ||
| self._last_reconnect_at = time.time() | ||
| self._backend_lock.release() |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
I think it's reasonable to use a finally here, in the odd case that an exception ends up thrown
| self._backend = None | ||
|
|
||
| metrics.incr("cache.memcache.reconnect") | ||
| self._backend = self._class(self.client_servers, **self._options) | ||
| self._last_reconnect_at = time.time() | ||
| self._backend_lock.release() | ||
|
|
||
| return self._backend |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| self._backend = None | ||
|
|
||
| metrics.incr("cache.memcache.reconnect") | ||
| self._backend = self._class(self.client_servers, **self._options) | ||
| self._last_reconnect_at = time.time() | ||
| self._backend_lock.release() |
There was a problem hiding this comment.
I think it's reasonable to use a finally here, in the odd case that an exception ends up thrown
| self._backend.close() | ||
| self._backend = None | ||
|
|
||
| metrics.incr("cache.memcache.reconnect") |
There was a problem hiding this comment.
Should this stat only fire inside of if self._backend? Otherwise it's counting the first connection as well
| reconnect = False | ||
| age = time.time() - self._last_reconnect_at | ||
| if not self._backend or age >= self._reconnect_age: | ||
| reconnect = True |
There was a problem hiding this comment.
Nit: I think calling this reconnect is a little misleading, because it also covers the first connect. Although I don't have a great suggestion for a better name.
There was a problem hiding this comment.
In adding the try/finally I've removed this variable. I agree it wasn't a great name.
| metrics.incr("cache.memcache.reconnect") | ||
|
|
||
| self._backend = self._class(self.client_servers, **self._options) | ||
| self._last_reconnect_at = time.time() |
There was a problem hiding this comment.
Missing re-check of age causes unnecessary backend reconnections
Medium Severity
The age variable is computed at line 36 before acquiring the lock, but after the lock is acquired, the code doesn't re-check whether _last_reconnect_at was updated by another thread. The if self._backend: check at line 41 only prevents closing a None backend, not a freshly-created one. When multiple threads hit the reconnect threshold simultaneously, the second thread will close the backend that the first thread just created, causing unnecessary connection churn.
We often see load-imbalances in our memcache clusters. Application containers connect through a pool of twemproxy instances, and because application containers are long-lived and only connect on startup, we can end up with imbalanced load.
By periodically disconnecting and reconnecting to twemproxy we have a better chance of getting more even load distribution. It doesn't look like we have memcache available in CI right now, and writing tests with only mocks would yield tautological tests.
Refs PRODENG-702