Skip to content

On O_DIRECT

January 20, 2013

Recently a college of mine was telling me how good O_DIRECT was because it was fast and because it guaranteed you that your data was actually written on disk without the need to call fsync (or fdatasync) explicitly. I had never used O_DIRECT and I didn’t know anything about it except that it required the memory buffers to be aligned on a particular boundary so I decided to fill this hole in my knowledge and educate myself about O_DIRECT. I started by just googling it and the first think I encountered was this: http://www.kernel.org/doc/man-pages/online/pages/man2/open.2.html. ….Holy cow! That article had a special section devoted entirely on O_DIRECT. Unfortunately the article didn’t give a very good description of what O_DIRECT actually does and it didn’t state anything with certainty. The citation by Linux Torvalds I found particularly disturbing. What was this feature that my colleague said it was really good, that the man page failed to explain clearly and Linus said that it was complete rubbish?! So I decided to investigate further. I downloaded the newest version of the linux kernel (3.7.2 stable) and started poking about and here is what I found:

(NOTE!!! this next paragraph assumes that you are at least a bit familiar with the linux kernel and the block device layer).
O_DIRECT is definitely a big deal. It obviously requires specific callback functions to be set by the file system. For example in the ext2 implementation (source under linux-source-tree/fs/ext2/inode.c) the function is called ext2_direct_IO and is set as part of the ext2_aops. The same pattern can be seen in ext3 (source under linux-source-tree/fs/ext3/inode.c the particular function is called ext3_direct_IO). One thing that I noticed is how simple the implementations of these functions are. The ext2 implementation (being obviously the simplest) just calls blockdev_direct_IO. The implementation of blockdev_direct_IO can be found under linux-source-tree/include/linux/fs.h it calls __blockdev_direct_IO which calls do_blockdev_direct_IO in linux-source-tree/fs/direct-io.c. do_blockdev_direct_IO makes some sanity checks then creates and initializes a dio structure (the first few lines of the function body), iterates through all the sectors that need to be read/written (line 1185 to be precise) and schedules a bio (aka a read/write request) towards the block device (using do_direct_IO on line 1206) and finally waits for all the scheduled bios to be completed (from line 1266 downwards).

Hymm… direct-io.c is quite lengthy it does a lot of stuff just to support O_DIRECT. Obviously the FS caches are completely bypasses but any caches the block device driver or the physical hardware might hold are not because the REQ_FLUSH (or the REQ_FUA) flag is not set for the bio structures created and issued by direct-io.c. Which means that you can’t count on your data actually being written on disk and must call fsync anyway.

On top of that the buffers which are used when working with files opened with O_DIRECT must be align on a specific boundary (usually 4k). With all this limitations, constraints and lack of certainty I started wondering who is using O_DIRECT anyway? It sounds like a lot of trouble for not much. So I set to investigate that too. Linus has stated that O_DIRECT was used by some databases so I looked at a few common and popular ones:

* Berkeley DB: as can be seen in berkely-source-tree/src/os/os_open.c on line 72 Berkely uses O_DIRECT only if HAVE_O_DIRECT is defined. Under linux however HAVE_O_DIRECT is not defined with the comment that O_DIRECT is broken under linux (see berkeley-source-tree/dist/configure.ac for more info). I used berkeley db version 5.3.21

* Level DB: LevelDB doesn’t even try to use O_DIRECT. All new files in level db are opened using the NewWritableFile function. It is implemented differently on the different operation systems leveldb supports. Its linux implementation can be found under level-db-source-tree/utils/env_posix.cc.

* MariaDB (aka mySQL): Maria DB actually supports O_DIRECT for all its operations. O_DIRECT usage can be toggled by setting the innodb_flush_method option to O_DIRECT. But as is explained here (http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html#sysvar_innodb_flush_method) it’s off by default and the usage of O_DIRECT doesn’t exclude the use of fsync. Additionally there is no guarantee that O_DIRECT will make the DB any faster.

* PostgreSQL: PostgreSQL uses O_DIRECT only for its write ahead log (WAL). It turns out that even so O_DIRECT has caused quite a lot of headache for the postgre developers at least accourding to this: http://postgresql.1045698.n5.nabble.com/We-really-ought-to-do-something-about-O-DIRECT-and-data-journalled-on-ext4-td3287127.html. Also in this thread http://www.postgresql.org/message-id/4C1A6339.9080300@2ndquadrant.com one of the postgre developers explains that all he has read and heard about O_DIRECT is disappointing.

In conclusion I can say that Linus Torvalds is once again corrent: O_DIRECT is completely useless and shouldn’t be used.

From → Uncategorized

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: