Saturday, September 29, 2007

Linux kernel scheduler VS. Windows kernel scheduler

I have been conducting systems administration interviews for a while now, and I used to ask that one question in every interview, which is better “Linux” or “Windows”. I used to settle for a simple answer like “Depending on the environment”, this answer could get the guy into our payroll on the spot. This was in the old days when I was still young and foolish. In these old days, I used to forgive my Windows when it hangs like forever, trying to do something I don’t know about, or when my server bails out of me for no good reason or with no trace, but you know people do grow up.

For a long period now I have been playing with operating systems, including Linux and Windows. After being a loyal follower to the Microsoft technologies, I had a paradigm shift. I saw the beauty of Linux, and I touched base with the meaning of operating system. And day after day I started to understand how Linux outperforms Windows, lets take for example, why does sometimes, Windows stop responding to your requests, and start playing busy. Your Hard disk lids start blinking, and no matter how much you click anywhere your computer never give you attention. In this article we will try to explain why this happens on Windows and rarely on Linux.

Lets imagine your process getting into the OS and praying that it reaches the CPU before it starves. according to this article, In Windows, the Kernel scheduler has two queues, a foreground queue with Round and Robin algorithm and a background Queue with First in First out algorithm, and the scheduler uses many priority algorithms along with other algorithms to decide to get your poor process into which queue. The problem here that Windows scheduler works with a multilevel queue technique, meaning once you are in the queue you are stuck there until your time come to get into the CPU or starve to death.. its simple.. but a retarded one too. So, what happens when many background processes get into the background queue? These processes are not time sliced as the ones in the foreground queue, so once they get in, they will never get out until they finish.. and to make things even better the scheduler chooses between the two queues with a probability of 80% for the background queue to 20% for the foreground queues, so if odds are against you, which seems to be always this way with me, your process has to wait a long time until it get served, and so all what you get is the freezing screen and the busy Hard disk signal.

So what does Linux do, Linux scheduler is a bit smarter, it uses a technique called multilevel queue with feedback. Hmmmm.. Feedback gives the impression that the process is able to discuss its state with the scheduler, and not like windows accept its fate to be doomed in the never lands. Yes, Linux has more than two queues with different algorithms and priority, and processes can move from one queue to another according to its state, so if a process was stuck for a long time in a queue and didn’t get served, its priority increase and it get moved to another VIP queue where processes get served at once.

Some say all this complexity in the Linux scheduler will create an overhead and slow things down when you have a large number of processes. But Linux uses O(1) scheduling algorithms, so as Windows to give it credit, which means its not subject to the number of processes, and its smart, and it really respect your requests and doesn’t give you the sense that the computer is doing a much more important thing than your pathetic request.

So where do you want to go today? I know where I am going.

Hard disk caching, is it that useful?

I was sitting with a couple of friends of mine, and we were discussing scalable system performance issues, we were specifically discussing mail system performance issues, when all of a sudden, one of us said "it would be better to eliminate all system caching, then we will have better performance", Wow.. where did this come form?

Eliminating cache to boost performance? we all know that there are several system caches and buffers that stores last fetched items so that whenever someone asks for them again, the system doesn't need to go there again to fetch them, its like an intermediate faster and smaller storage with most recent viewed blocks..

Many algorithms, theories govern the best way to cache I/Os, what to cache and how much to cache, locality and other cache issues, and they all work for the well fair of decreasing the time to read data, so how come removing our friends here might enhance performance.
In our case its a mail system, with about 3 million users, meaning 3 million mail boxes.. and the read/write activities specially the I/O activities are all on the mail boxes, in matter of fact what is cached here is the user's mail box.. so how often could two users ask for the same mailbox? so the cache is almost not usable, cause everytime a user asks for his mailbox, its fetched from the HD and cached, and then another one does the same, until the cache is full and we need to remove blocks(users mailboxes) to free space, and if we are talking about thousands of visits then the possibility that a user revisits his mailbox and finds it in the cache is very low, Those copies in turn imposes the CPU and memory cost of moving the data from cache to userspace destination buffer for reads, and the other way around for writes.

At this moment i started to think of a more radical idea, why not tweak the H.D. cache too, I mean the hardware cache on the hard disk itself, or use cheapo Hard Disks with no internal caches, lets make it a cacheless system, with direct access to data without any unnecessary overheads, no need to use an intermediate zone for any reason, its a cache free world!
But then, what will be the impact of this? how could we use caching in such systems? what are the best practices, what if all of this is just wrong speculations, and that cache do make a difference even in such systems. I think we need to make some tests and benchmarks to validate what we are saying here. maybe i will work on this one day.

MySQL FALCON storage engine.

I have been getting more and more into MySQL for the last couple of months, and to tell you the truth i was really impressed, i can't say i am a database guru, but i had my experience with database engines before, and for my surprise.. MySQL does match up with the big players in this sector.

I can still remember MySQL from the old days when it was used to be thought of as a light, fast database storage engine, that can only use with personal websites, but now MySQL has everything an enterprise DB engine would need.... and more.

What got me really dazzled in this engine, is its layered architecture and pluggable storage engine architecture, this DB engine was designed in a way to separate the storage engine from the other system, so you can plug in any engine from a vast list, or even do your own !!

One of the new players in MySQL's storage is FALCON , this one is to be said the Oracle innodb killer, or isn't it?

Falcon (code name) is a transactional storage engine, based on Netfrastructure database engine, extended and integrated into MySQL.

The main goals of Falcon are to exploit large memory for more than just a bigger cache, to use threads and processors for data migration. Falcon has a larger row cache with age group scavenging. Falcon is multi-version in memory and single version on disk, True Multi Version Concurrency Control (MVCC) enables records and tables to be updated without the overhead associated with row-level locking mechanisms. The MVCC implementation virtually eliminates the need to lock tables or rows during the update process, also data and index caching provides quick access to data without the requirement to load index data from disk.

So it seems that it has everything to be the ace of all storages, no? think again, a benchmark was made to compare the performance of this new engine with old MyISAM and InnoDB here, but regretfully benchmarks are not in the favor of Falcon.. InnoDB and MyISAM scored over Falcon in different queries.

From what I read I can say that Multi version Concurrent Control system implemented in the new engine, made a draw back in some cases rather than boosting the performance, having multiple snapshots for every session, so achieving better locking, but at the same time the need to access the data beside the key is needed, and as we know keys are used a lot for optimizing, this is beside the overhead coming from such mechanism.

We can see also that it has a bug with queries that have LIMIT in it, the performance drops drastically when used.

The question here lies, will Jim Starkey bring his house into order and make Falcon the number one storage engine as promised? In all ways we are the winners, as we have other alternatives , and as they say.. competition is the consumer's number one friend.

Mysql Network Monitor Adviser

We had this installed on one of our DataBase servers early today, I had the feeling that this will be JAMT (Just Another Monitoring Tool) telling you CPU utilization, Memory Usage.. and maybe some extra readings on Cache hit ratio and running queries, what else could it be, nothing more then some queries that could be done by some Bash/Perl scripts and shown with a nice web AJAX interface, Nothing I can't do.. or so I thought.

The installation went smoother than I expected, the idea that it installs Apache, mysql, tomcat and Java beside the instances already running on the server worried me a bit, i feared they might conflict with running production instances, but again I was wrong, the installer detected running services and installed itself somewhere else on different ports.. hmm smart.

The installation was done by just running a bin file for the server and another for the agent, and the setup walks you through an interactive installation.. till now i was not that impressed, it was just a clean installation script, something expected from MySQL.

The server ran smoothly and so did the agent, and ofcourse they communicated without any problems or interference from myside, and then the show began.

At the beginning it was everything i expected, some monitoring scripts shown on a flashy web interface, until i saw this icon, telling me i have a problem..

Table Scan and Query cache.. Interesting.
Once i clicked on the Query cache i got this message

Evaluate whether the query cache is suitable for your application. If you have a high rate of INSERT / UPDATE / DELETE statements compared to SELECT statements, then there may be little benefit to enabling the query cache. Also check whether there is a high value of Qcache_lowmem_prunes, and if so consider increasing the query_cache_size."

Its not JAMT, its an adviser, it detects poor performance and configuration, and walk you through steps to analyze and fix your problems.

MySQL is impressing me day after day, i don't know if Oracle has such tool, but i know SqlServer doesn't!

What i see is that MySQL puts the client in mind, this poor DBAdmin who just sits days after days trying to resolve performance issues, and strange application activities, and then MySQL uses technology to serve him well, and make his life easier. On the other hand other Players in this market sector target the business owners, and how they could impress them using eye catching slogans like "High performance", "High availability", "Load balancing".. MySQL knows better that it can't do this without its real clients, the DBadmins and the SYSAdmins, so it keeps things simple and works with them to make MySQL a better place for Data.

I am really interested to know how MySQL is going to astonish me again next time, I am waiting for this, and i think it will be soon.

MogileFS revisited

So i got this reply on my recent post

"Please recall that MogileFS has no POSIX file API. All file transfers
are done via HTTP. So, it really isn't a drop-in replacement for NFS
or any other network file system. You need to add logic to your
application to deal with MogileFS.

Also, you can't do updates to a file; you must overwrite the entire
file if you make any changes.

MogileFS is primarily intended for a write-once/read-many setup."

So how would this fit in our system, for a starter I think it won't be of much impact, since we are storing system images. The idea of updating files won't be an issue, as images intend to be very large, and once stored it is either replaced by a newer image or used to restore a system. Also we are going to use Ruby on Rails to interface with the system Imager, our ope source imaging system, and ruby has a plugin for MogilrFS, so it won't be a problem to integrate it, and everything seems ok.

What about other systems, how could be MogileFS useful in other systems.. Would these issues be a problem for application in need for a smart storage? Lets take a Mail system for example, we have multiple servers serving a domain, and users' mail boxes are spread among these servers, The file in this case will be the emails, and since we need no update on the emails, write once/read many condition will be fulfilled. Although if the mail service was not tailored made or customized, it will be hard to integrate MogileFS, meaning if you are using a ready made Mail server like Sendmail or Qmail, you will find difficulties to make MogileFS your storage engine.

As a conclusion MogileFS is better used with applications that are developed with MogileFS as its storage engine in mind. Although you can use it with out of the box systems, it won't be smooth ride, but fr sure there are some systems which will not benefit from MogileFS like file sharing or workflow systems.
Still I can't wait to try it over, and keeping you updated.

MogileFS Storage engine!

I came across this today, It seemed interesting.. MogileFS is intended for storage hungry applications, its all about spreading your files across cheap devices on different hosts, something like RAID+NFS+DataReplication.

The Idea is very nice and simple, you have multiple servers, and every server has multiple devices, you sum up all these storage units into one big storage, you have a tracker application that you consult when reading or writing to this huge storage, and the tracker take responsibility of saving your data and making sure that your data is available even if multiple hosts went off line.

This application just came in time, we just had an idea of a project that takes images from your server and store it on a network storage, so if something wrong happens to your server you can simply take this image and restore it back to your server, or you can even restore this image on a different server to clone it, or something like that. The challenge was where to store all of these images. By doing a simple calculation, if you have 100 users and every user has a 10 G.B. image, then you are bound to maintain a tera of storage.. and scalability will be an issue.

With MogileFS you will gain three advantages here, one, you will have cheap disks on cheap servers with your storage distributed on it. Two, you will gain from this distribution by installing the application on all of these servers, and so gaining high availability. Three, scaling will be as simple as adding a server to this farm. So with about half the price of a SAN and its expensive disks, you will get high availability for your storage and application. Ofcourse we will have to manage this distributed environment. One of the ways to tackle it is to create no slave architecture, all servers are masters, and every server can detect on which server the user’s image is stored by consulting the tracker. So when a user logs in, he will first go to any server according to Round and Robin algorithm, and from this server he will be redirected to the server storing his image, where he can get served, while eliminating the network communication overhead.

This architecture can be implemented with any storage intensive application, or any application that used to rely on NFS, as NFS has proven to be unreliable in heavy production environments.

I like this tool very much, and I can’t wait to test it on our application.. so I will keep you posted with any updates.