Thanks for this!
I adapted your performance script to do a quick dirty check -- swapping the ctx for lru, and running 2 tests -- looping(write+read) and write+loop(read)
On both operations, the overhead is small... but the per-worker LRU cache is faster than the shared dict by factor of around 10x.
This sounds in-line with what I remember reading earlier. I needed to pull up some actual stats though, and this dirty test-suite does that.
We have some code in production that stores ssl certificates as cdata in the lru cache and fails over pem data stored in the shared dict (and then fails to upstream data). A friend is dealing with some performance caching issues on high-traffic periods and is using the shared-dict. I thought enabling the LRU cache might help a bit.