Cross platform coding: data type issues to remember.

1. long data type.
Always remember on Windows long is 4 bytes in size. On Linux it is 8 bytes!

2. Reading unsigned short variable.

void foo(char *shortString)
    unsigned short val;
    sscanf(shortString, "%u", &val);

This code works properly on Ubuntu. But gives core dump on CentOS / RHEL.
Never read a short variable using “%d” or “%u”, always use “%h” or “%hu”

    // this is the right thing to do.
    sscanf(shortString, "%hu", &val);

GDB: find the thread which has locked the mutex

Q. In Linux if a multi threaded code seems hanged how to find which thread has locked the concerned mutex?
1. Attach the gdb to the concerned process.

$sudo gdb -p pid

2. Get the information of all the running threads.

(gdb) info threads
 20 Thread 0x7f3804dde700 (LWP 19453) "XYZ" 0x00007f38db3afb9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
 19 Thread 0x7f37f82dd700 (LWP 19454) "XYZ" 0x00007f38db3afb9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
 18 Thread 0x7f37f7adc700 (LWP 19455) "XYZ" 0x00007f38db3af6dd in accept () at ../sysdeps/unix/syscall-template.S:81
 17 Thread 0x7f37f70d0700 (LWP 19460) "XYZ" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
 16 Thread 0x7f37f68cf700 (LWP 19461) "XYZ" 0x00007f38dbcecc0b in __memp_get_bucket () from /usr/lib/x86_64-linux-gnu/
 15 Thread 0x7f37f60ce700 (LWP 19463) "XYZ" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
 14 Thread 0x7f37f58cd700 (LWP 19464) "XYZ" 0x00007f38db3af7eb in __libc_recv (fd=27, buf=0x7f37f58ccd88, n=4, flags=-616892437)
 at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
 13 Thread 0x7f37f50cc700 (LWP 19466) "XYZ" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
 12 Thread 0x7f37f48cb700 (LWP 19467) "XYZ" 0x00007f38db3af7eb in __libc_recv (fd=30, buf=0x7f37f48cad88, n=4, flags=-616892437)
 at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33

3. Lets say thread 17 is not able to lock a mutex. Now check the details of the thread 17.

(gdb) thread 17
(gdb) where
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f38db3aa664 in _L_lock_952 () from /lib/x86_64-linux-gnu/
#2  0x00007f38db3aa4c6 in __GI___pthread_mutex_lock (mutex=0x7f37e8004da8) at ../nptl/pthread_mutex_lock.c:114
#3  0x000000000041602f in LinuxMutex::lock (this=0x7f37e8004da0) at linux/mutex.cpp:34

Print the details of the locked mutex. In my case Mutex I’m trying to lock is of type LinuxMutex which in turn is a wrap over class around pthread_mutex_t. So I’m diplaying the class object, which has a private member variable m_mutex, which is of type pthread_mutex_t . In your case, you can directly print the pthread_mutex_t variable itself.

(gdb) p *((LinuxMutex *) 0x7f37e8004da0)
$4 = { = {_vptr.Mutex = 0x433130 }, m_mutex = {__data = {__lock = 2, __count = 1, __owner = 19461, __nusers = 1, __kind = 1,
      __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\001\000\000\000\005L\000\000\001\000\000\000\001", '\000' ,
    __align = 4294967298}}

The pthread_mutex_t variable has __data.__owner variable, which indicates the thread ID of the thread which has currently locked the mutex.
In our case the thread which has locked the mutex is 19461.

4. Check again the info threads to find out the thread which has locked the mutex.

(gdb) info threads
 20 Thread 0x7f3804dde700 (LWP 19453) "XYZ" 0x00007f38db3afb9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
 19 Thread 0x7f37f82dd700 (LWP 19454) "XYZ" 0x00007f38db3afb9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
 18 Thread 0x7f37f7adc700 (LWP 19455) "XYZ" 0x00007f38db3af6dd in accept () at ../sysdeps/unix/syscall-template.S:81
 17 Thread 0x7f37f70d0700 (LWP 19460) "XYZ" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
 16 Thread 0x7f37f68cf700 (LWP 19461) "XYZ" 0x00007f38dbcecc0b in __memp_get_bucket () from /usr/lib/x86_64-linux-gnu/
 15 Thread 0x7f37f60ce700 (LWP 19463) "XYZ" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
 14 Thread 0x7f37f58cd700 (LWP 19464) "XYZ" 0x00007f38db3af7eb in __libc_recv (fd=27, buf=0x7f37f58ccd88, n=4, flags=-616892437)
 at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
 13 Thread 0x7f37f50cc700 (LWP 19466) "XYZ" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
 12 Thread 0x7f37f48cb700 (LWP 19467) "XYZ" 0x00007f38db3af7eb in __libc_recv (fd=30, buf=0x7f37f48cad88, n=4, flags=-616892437)
 at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33

So here thread 16 is the thread which has locked the mutex needed by the thread 17.

Linux: get absolute path of running shell script / executable

Q. In Linux from a running shell script or a C/C++ executable how to get the absolute path of itself?


For shell script:
Below line will give you the absolute path of shell script that is running.

ABSOLUTE_DIR_PATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

For a c/c++ executable:

    IN std::string &path
    char linkname[1024];
    ssize_t r;

    r = readlink("/proc/self/exe", linkname, 1024);
    assert(-1 != r);
    linkname[r] = '\0';
    path = linkname;
    std::size_t found = path.find_last_of("/");
    path = path.substr(0, found+1);



TCP/IP performance tuning

I was getting 70 MBps network IO performance for my TCP/IP based RPC program on 1Gb network. I ran the same program on 10GB network, I was expecting minimum 7-8X performance gain. But to my surprise the gain was merely 10% only.

I was using below settings to my TCP client and server socket.

// Set send buffer size
 setsockopt(sockfd, SOL_SOCKET, SO_SNDBUF, &sendbuff, sizeof(sendbuff));
 // Set receive buffer size
 setsockopt(sockfd, SOL_SOCKET, SO_RCVBUF, &recvbuff, sizeof(recvbuff));
 // Set no delay option
 int flag = 1;
 setsockopt(sockfd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag));
 // Set keepalive socket
 int flag = 1;
 setsockopt(sockfd, SOL_SOCKET, SO_KEEPALIVE, &flag, sizeof(flag));

I started doing some random experiments by turning on/off socket settings. I achieved the 7-8X performance gain when I disabled socket send buffer and receive buffer sizes.

When I started exploring more regarding this I came across TCP autotuning concept. More about this can be found here:

Important notes from the above page:

TCP Autotuning automatically adjusts socket buffer sizes as needed to optimally balance TCP performance and memory usage. Autotuning is based on an experimental implementation for NetBSD by Jeff Semke, and further developed by Wu Feng’s DRS and the Web100 Project. Autotuning is now enabled by default in current Linux releases (after 2.6.6 and 2.4.16). It has also been announced for Windows Vista and Longhorn. In the future, we hope to see all TCP implementations support autotuning with appropriate defaults for other options, making this website largely obsolete.

NB: Manually adjusting socket buffer sizes with setsockopt() disables autotuning. Application that are optimized for other operating systems may implicitly defeat Linux autotuning.

Do not use setsockopt() to set send / receive buffer sizes unless you’ve found out the buffer sizes for your application which will out perform the TCP auto tuning. In general cases it is better to rely on the TCP autotuning.

Linux boot loader and boot flag sensitivity

Noting down the observation:

Grub boot loader is not sensitive to boot flag setting of the boot partition. If the boot flag is not set for the boot partition or has been wrongly set to non boot partition it can boot the Linux.
On the other hand extlinux boot loader is sensitive to boot flag settings. If boot flag is not set for the boot partition, it fails to boot the Linux machine with error “No boot device found”.

How to list the physical disk on Linux machine?

Q. How to list the physical disks present in a linux machine? List the file names like /dev/sdX of all the physical disk.

There are many commands to do that.

A. lsblk:
 $ sudo lsblk -d | grep disk
 fd0 2:0 1 4K 0 disk
 sda 8:0 0 128G 0 disk
 sdb 8:16 0 5G 0 disk
 sdc 8:32 0 50G 0 disk

This will list all the physical disk. It also includes the md / lvm devices you need to filter them out. Change the command to get the physical disk list like below:

$ sudo lsblk -d | grep disk | grep “^sd” | tr -s ‘ ‘ | cut -d ‘ ‘ -f 1

B. blkid:

$ sudo blkid
 /dev/sr0: LABEL="VMGUEST" TYPE="iso9660"
 /dev/sda1: UUID="202b983a-02c4-4ee0-9a18-3d3b38f7bc7a" TYPE="swap"
 /dev/sda2: UUID="d7441b33-1aa1-459c-af17-6459b0d18a19" TYPE="ext4"
 /dev/sda3: UUID="611a5df3-16bc-4240-97bd-c8bbdd8ebf2c" TYPE="ext4"
 /dev/sdb1: UUID="6ed027ff-04ff-42a7-afad-08726ada10f9" TYPE="ext4"
 /dev/sdc1: UUID="9a1c60cf-7790-4ed9-9da3-d7c1a9471d78" TYPE="ext4"
 /dev/sdc2: UUID="z32bc1-xzAH-yUvM-xVJp-3hLe-QAj2-Zxp6uN" TYPE="LVM2_member"
 /dev/sdc3: UUID="NWBkAy-Nxtw-OKkq-wdxM-N6db-VzOP-t76PLT" TYPE="LVM2_member"
 /dev/sdc5: UUID="aC1WB9-5p0Y-EsPD-yPBD-m1x0-Q0qW-kuZxXw" TYPE="LVM2_member"
 /dev/sdc6: UUID="bO8SNW-YOWd-6qke-yZIf-5ekT-FbQJ-W2OFnC" TYPE="LVM2_member"

We can modify this command to list physical disk like below:

 $ sudo blkid | grep "^/dev/sd" | cut -d ':' -f 1 | sed s'/[0-9]//g' | sort | uniq

C. fdisk

 $ sudo fdisk -l
Disk /dev/sda: 137.4 GB, 137438953472 bytes
255 heads, 63 sectors/track, 16709 cylinders, total 268435456 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0003a379
Device Boot Start End Blocks Id System
/dev/sda1 2048 62500863 31249408 82 Linux swap / Solaris
/dev/sda2 * 62500864 64454655 976896 83 Linux
/dev/sda3 64454656 268433407 101989376 83 Linux
Disk /dev/sdb: 5368 MB, 5368709120 bytes
181 heads, 40 sectors/track, 1448 cylinders, total 10485760 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000497c4
Device Boot Start End Blocks Id System
/dev/sdb1 * 2048 10485759 5241856 83 Linux
Disk /dev/sdc: 53.7 GB, 53687091200 bytes
255 heads, 63 sectors/track, 6527 cylinders, total 104857600 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000dc579
Device Boot Start End Blocks Id System
/dev/sdc1 * 2048 12290047 6144000 83 Linux
/dev/sdc2 12290048 32770047 10240000 8e Linux LVM
/dev/sdc3 32770048 53250047 10240000 8e Linux LVM
/dev/sdc4 53250048 104857599 25803776 5 Extended
/dev/sdc5 53254144 73734143 10240000 8e Linux LVM
/dev/sdc6 73736192 94216191 10240000 8e Linux LVM
/dev/sdc7 94218240 104458239 5120000 83 Linux

You can change this command to get the physical disk like below:

 $ sudo fdisk -l | grep "^Disk /dev/sd" | cut -d ' ' -f 2 | sed 's/://'

Linux How to reclaim page cache?

What is page cache?
To optimize disk performance Linux caches data of recently read / written files. This part of RAM is called page cache.

Why would you need to reclaim page cache?
By default Linux reclaims page cache using LRU algorithm. While LRU algorithm suffice for general purpose, it may not be suitable for specific requirements of an IO intensive software. In that case the software should implement it’s own logic to reclaim page cache.

How to advice the kernel to reclaim the page cache specific to a file?
posix_fadvise is the system call used to advice kernel about the page cache behavior for the particular file.

If you want to open a file for reading and wish that it should not consume page cache do it following way:

#include <unistd.h>
#include <fcntl.h>
int main(int argc, char *argv[]) {
    int fd;
    int ret;
    fd = open(argv[1], O_RDONLY);
    ret = posix_fadvise(fd, 0,0,POSIX_FADV_DONTNEED);
    return 0;

If you want to reclaim page cache of file being written, it needs to be handled differently. Linux will need the data of file being written in page cache till it does not flush the contents on the disk. Till that point it can not reclaim page cache. So before calling posix_fadvise, you will have to call fdatasync or wait for sufficient amount of time (greater than 30 seconds) before calling posix_fadvise.

Why 30 seconds?
Because max after 30 seconds, Linux treats data in page cache as stale and initiates flushing on disk using pdflush/bdflush thread. The related setting can be found in file:
(default 3000): In hundredths of a second, how long data can be in the page cache before it’s considered expired and must be written at the next opportunity. Note that this default is very long: a full 30 seconds. That means that under normal circumstances, unless you write enough to trigger the other pdflush method, Linux won’t actually commit anything you write until 30 seconds later.

More information can be found at:

RAID5 and journaling file system performance secrete!

Q. How to tune performance of journaling file system like ext4 on a RAID 5 or RAID 6 volume? Why I don’t get ext4 file system performance on parity calculating RAID like RAID 5 or RAID 6 similar to non parity calculating RAID 0?

Let us first understand the journal options for ext4 file system. Other file systems like xfs, btrfs, reiserfs, ext3 are journaling file systems. The answer applies to all of them.

A journaling filesystem is a filesystem that maintains a special file called a journal (which is usually a circular log) that is used to record the file system updates. In the event of a system crash, a given set of updates may have either been fully committed to the filesystem (i.e., written to the HDD), in which case there is no problem, or the updates will have been marked as not yet fully committed, in which case the system will read the journal, which can be rolled up to the most recent point of data consistency. Thus journal can provide file system consistency.

ext4 provides following journaling options:
1. data=journal
All data are committed into the journal prior to being written into the main file system. Enabling this mode will disable delayed allocation and O_DIRECT support.

2. data=ordered
All data are forced directly out to the main file system prior to its metadata being committed to the journal.

3. data=writeback
Data ordering is not preserved, data may be written into the main file system after its metadata has been committed to the journal.

data=ordered is the default journal mode for ext4.

Issue with parity calculating RAID.
When you format RAID 5 or RAID 6 volume with ext4 file system with default options, journal is created on the same volume. The parity calculation rules for the file system also applies for the journal of the file system. For every newly created file, parity is calculated for the data part of the file, for the metadata part of the file as well as the journal entry of the file. This starts badly affecting the performance of file system.

To overcome this, one simple way is to maintain the journal of the file system of a separate disk and not on the same parity calculating volume. This gives great boost to the file system performance.

How to create ext4 file system with journal on separate disk?
I have my RAID5 volume on device /dev/sdb carved out of 6 disks, and I have decided to use partition /dev/sda5 as my journal device.

1. Create journal device
sudo mke2fs -O journal_dev /dev/sda5

2. Create ext4 filesystem on RAID5 device and point the journal device to /dev/sda5
sudo mkfs.ext4 -J device=/dev/sda5 /dev/sdb -b 4096 -E stride=64,stripe-width=320

Linux Memory Management – How to read top output?

Q. How to understand the memory usage from the output of top command in linux? What is buffers, cached memory?

top command’s output looks like as below:

top - 10:25:14 up 5 days, 21:47, 1 user, load average: 5.98, 6.40, 6.52
Tasks: 308 total, 2 running, 306 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.0%us, 10.8%sy, 0.0%ni, 75.1%id, 6.7%wa, 0.0%hi, 0.4%si, 0.0%st
Mem: 65967440k total, 64164816k used, 1802624k free, 15331976k buffers
Swap: 31999996k total, 85556k used, 31914440k free, 15929768k cached

38 root 20 0 0 0 0 S 0 0.0 3:58.90 rcuos/3
45 root 20 0 0 0 0 S 0 0.0 1:39.88 rcuos/10
47 root 20 0 0 0 0 S 0 0.0 3:38.73 rcuos/12
154 root 20 0 0 0 0 S 0 0.0 3:38.41 kworker/4:1
160 root 20 0 0 0 0 S 0 0.0 2:02.53 kworker/10:1

How to read the top output?

1. Mem: 65967440k total
Is the total RAM available on the machine.

2. 64164816k used,
Is the used memory by linux process & disk IO operations.

3. 1802624k free
Is the free memory available on the machine.

4. 15331976k buffers
Is the memory used by machine to cache inode & directory entries.

5. 15929768k cached
Is the memory used as page cache to store data of recently read/written files.

6. Swap: 31999996k total
Is the total swap space available in machine.

7. 85556k used
Is the swap space used by machine.

8. 31914440k free
Is the swap space available on machine.

I was really confused to find out that: (buffers + cached + total memory used by all process) was not matching to used memory on system. I could not understand this unaccounted memory until I came across:

How to preallocate directory in ext4 file system?

Q. I’m implementing a software on top of ext4 file system, which will have thousands of directories and inside each of the directory there will be thousands of regular file. During the lifetime of the software the files inside the directories keep growing. As a result the blocks of directories get spread across the entire filesystem. This leads to poor performance for the directory traversal.
To fix this issue I wanted to have all the blocks of any of the directory allocated contiguously on the file system. For this I wanted a mechanism to preallocate directories on ext4 file system.

ext4 file system supports preallocation of regular files using system call fallocate. But there is no support for preallocation of blocks for a directory. Why this is so? I guess no one thought it is a required feature.

Anyway this can be implemented by an unrelated characteristic of ext4 filesystem. That is, ext4 never shrinks the directory size. That means once a block is allocated to a directory, ext4 never reclaims it back from the directory. We can exploit this feature to preallocate the blocks for a directory in the ext4 file system.

For example I’ve a directory foo, I’m anticipating that over the time this directory will contain 60000 files of average filename upto 48 characters. To preallocate data blocks for foo directory to hold these futuristic data entries, take following steps:
1. Create foo directory
2. Inside foo directory, create zero byte 60000 files of 48 character file names.
3. Delete all these 60000, zero byte files.
4. At the end of this you create foo directory, which has preallocated blocks for its futuristic need. Delayed allocation of ext4 makes it sure that these blocks are (mostly) contiguous.

cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 48 | head -n 60000 > /tmp/files.lst
mkdir foo
cd foo
cat /tmp/files.lst | xargs touch
rm -rf *