RAID5 and journaling file system performance secrete!

Q. How to tune performance of journaling file system like ext4 on a RAID 5 or RAID 6 volume? Why I don’t get ext4 file system performance on parity calculating RAID like RAID 5 or RAID 6 similar to non parity calculating RAID 0?

Let us first understand the journal options for ext4 file system. Other file systems like xfs, btrfs, reiserfs, ext3 are journaling file systems. The answer applies to all of them.

A journaling filesystem is a filesystem that maintains a special file called a journal (which is usually a circular log) that is used to record the file system updates. In the event of a system crash, a given set of updates may have either been fully committed to the filesystem (i.e., written to the HDD), in which case there is no problem, or the updates will have been marked as not yet fully committed, in which case the system will read the journal, which can be rolled up to the most recent point of data consistency. Thus journal can provide file system consistency.

ext4 provides following journaling options:
1. data=journal
All data are committed into the journal prior to being written into the main file system. Enabling this mode will disable delayed allocation and O_DIRECT support.

2. data=ordered
All data are forced directly out to the main file system prior to its metadata being committed to the journal.

3. data=writeback
Data ordering is not preserved, data may be written into the main file system after its metadata has been committed to the journal.

data=ordered is the default journal mode for ext4.

Issue with parity calculating RAID.
When you format RAID 5 or RAID 6 volume with ext4 file system with default options, journal is created on the same volume. The parity calculation rules for the file system also applies for the journal of the file system. For every newly created file, parity is calculated for the data part of the file, for the metadata part of the file as well as the journal entry of the file. This starts badly affecting the performance of file system.

To overcome this, one simple way is to maintain the journal of the file system of a separate disk and not on the same parity calculating volume. This gives great boost to the file system performance.

How to create ext4 file system with journal on separate disk?
I have my RAID5 volume on device /dev/sdb carved out of 6 disks, and I have decided to use partition /dev/sda5 as my journal device.

1. Create journal device
sudo mke2fs -O journal_dev /dev/sda5

2. Create ext4 filesystem on RAID5 device and point the journal device to /dev/sda5
sudo mkfs.ext4 -J device=/dev/sda5 /dev/sdb -b 4096 -E stride=64,stripe-width=320


Linux Memory Management – How to read top output?

Q. How to understand the memory usage from the output of top command in linux? What is buffers, cached memory?

top command’s output looks like as below:

top - 10:25:14 up 5 days, 21:47, 1 user, load average: 5.98, 6.40, 6.52
Tasks: 308 total, 2 running, 306 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.0%us, 10.8%sy, 0.0%ni, 75.1%id, 6.7%wa, 0.0%hi, 0.4%si, 0.0%st
Mem: 65967440k total, 64164816k used, 1802624k free, 15331976k buffers
Swap: 31999996k total, 85556k used, 31914440k free, 15929768k cached

38 root 20 0 0 0 0 S 0 0.0 3:58.90 rcuos/3
45 root 20 0 0 0 0 S 0 0.0 1:39.88 rcuos/10
47 root 20 0 0 0 0 S 0 0.0 3:38.73 rcuos/12
154 root 20 0 0 0 0 S 0 0.0 3:38.41 kworker/4:1
160 root 20 0 0 0 0 S 0 0.0 2:02.53 kworker/10:1

How to read the top output?

1. Mem: 65967440k total
Is the total RAM available on the machine.

2. 64164816k used,
Is the used memory by linux process & disk IO operations.

3. 1802624k free
Is the free memory available on the machine.

4. 15331976k buffers
Is the memory used by machine to cache inode & directory entries.

5. 15929768k cached
Is the memory used as page cache to store data of recently read/written files.

6. Swap: 31999996k total
Is the total swap space available in machine.

7. 85556k used
Is the swap space used by machine.

8. 31914440k free
Is the swap space available on machine.

I was really confused to find out that: (buffers + cached + total memory used by all process) was not matching to used memory on system. I could not understand this unaccounted memory until I came across:

How to preallocate directory in ext4 file system?

Q. I’m implementing a software on top of ext4 file system, which will have thousands of directories and inside each of the directory there will be thousands of regular file. During the lifetime of the software the files inside the directories keep growing. As a result the blocks of directories get spread across the entire filesystem. This leads to poor performance for the directory traversal.
To fix this issue I wanted to have all the blocks of any of the directory allocated contiguously on the file system. For this I wanted a mechanism to preallocate directories on ext4 file system.

ext4 file system supports preallocation of regular files using system call fallocate. But there is no support for preallocation of blocks for a directory. Why this is so? I guess no one thought it is a required feature.

Anyway this can be implemented by an unrelated characteristic of ext4 filesystem. That is, ext4 never shrinks the directory size. That means once a block is allocated to a directory, ext4 never reclaims it back from the directory. We can exploit this feature to preallocate the blocks for a directory in the ext4 file system.

For example I’ve a directory foo, I’m anticipating that over the time this directory will contain 60000 files of average filename upto 48 characters. To preallocate data blocks for foo directory to hold these futuristic data entries, take following steps:
1. Create foo directory
2. Inside foo directory, create zero byte 60000 files of 48 character file names.
3. Delete all these 60000, zero byte files.
4. At the end of this you create foo directory, which has preallocated blocks for its futuristic need. Delayed allocation of ext4 makes it sure that these blocks are (mostly) contiguous.

cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 48 | head -n 60000 > /tmp/files.lst
mkdir foo
cd foo
cat /tmp/files.lst | xargs touch
rm -rf *