Saturday, September 29, 2007

Hard disk caching, is it that useful?

I was sitting with a couple of friends of mine, and we were discussing scalable system performance issues, we were specifically discussing mail system performance issues, when all of a sudden, one of us said "it would be better to eliminate all system caching, then we will have better performance", Wow.. where did this come form?

Eliminating cache to boost performance? we all know that there are several system caches and buffers that stores last fetched items so that whenever someone asks for them again, the system doesn't need to go there again to fetch them, its like an intermediate faster and smaller storage with most recent viewed blocks..

Many algorithms, theories govern the best way to cache I/Os, what to cache and how much to cache, locality and other cache issues, and they all work for the well fair of decreasing the time to read data, so how come removing our friends here might enhance performance.
In our case its a mail system, with about 3 million users, meaning 3 million mail boxes.. and the read/write activities specially the I/O activities are all on the mail boxes, in matter of fact what is cached here is the user's mail box.. so how often could two users ask for the same mailbox? so the cache is almost not usable, cause everytime a user asks for his mailbox, its fetched from the HD and cached, and then another one does the same, until the cache is full and we need to remove blocks(users mailboxes) to free space, and if we are talking about thousands of visits then the possibility that a user revisits his mailbox and finds it in the cache is very low, Those copies in turn imposes the CPU and memory cost of moving the data from cache to userspace destination buffer for reads, and the other way around for writes.

At this moment i started to think of a more radical idea, why not tweak the H.D. cache too, I mean the hardware cache on the hard disk itself, or use cheapo Hard Disks with no internal caches, lets make it a cacheless system, with direct access to data without any unnecessary overheads, no need to use an intermediate zone for any reason, its a cache free world!
But then, what will be the impact of this? how could we use caching in such systems? what are the best practices, what if all of this is just wrong speculations, and that cache do make a difference even in such systems. I think we need to make some tests and benchmarks to validate what we are saying here. maybe i will work on this one day.

No comments: