How to make sure linux process doesn’t swap out?

Q. On Linux I’m working on a program, which uses lot of RAM (~8GB) for database. That program also generates huge file I/O activity, leading to use of great amount of page cache (~25GB). The program used to run for days, and I observed that db was not performing as expected. I realized this was because RAM pages of the process was getting swapped out. On Linux how you can prevent swapping out of RAM pages of a process?
ANS:
Swappiness:

Swappiness is a Linux kernel parameter that controls the relative weight given to swapping out runtime memory, as opposed to dropping pages from the system page cache. Swappiness can be set to values between 0 and 100 inclusive. A low value causes the kernel to avoid swapping, a higher value causes the kernel to try to use swap space. The default value is 60

You can check your swappiness value by:
user@machine:$ cat /proc/sys/vm/swappiness
60

You can change swappiness value by:
# Set the swappiness value as root
echo 10 > /proc/sys/vm/swappiness

One way to make sure that RAM pages assigned to running process should not swap out, is to set the sappiness value low, less than or equal to 10.

This will result in to global effect and will protect all the running processes on the system from getting swapped out. This may not be of your interest, if you wish to protect only a particular process. This can be solved by second approach.

mlockall:
mlockall pins all the RAM pages of a running process. For a particular process you can prevent swapping out of existing and future RAM pages by calling:
mlockall(MCL_CURRENT | MCL_FUTURE);

IP address validation

Q. How to validate an IP address (IPv4 or IPv6), given as a string input ?

Ans: Using regex is not good idea. Rather use function inet_pton to do the validation task. The pseudo code for IP address validation is as below:

bool
isValidIPAddress(
    IN char *ipAddr
    )
{
    unsigned char buf[sizeof(struct in6_addr)];
    int ret;

    ret = inet_pton(AF_INET, ipAddr, buf);
    if (1 == ret) {
        return true;
    }

    ret = inet_pton(AF_INET6, ipAddr, buf);
    if (1 == ret) {
        return true;
    }
    return false;
}

Use of FSCTL_GET_VOLUME_BITMAP to read $bitmap file from NTFS / ReFS volume

Q. How to read $bitmap file from NTFS / ReFS volume?

Ans:

FSCTL_GET_VOLUME_BITMAP IOCTL is provided to read the $bitmap file. To use this IOCT you should know the size of $bitmap in advance, If you do not know this then you’ll have to read $bitmap file in chunks and store it in file to refer it later. The sample pseudo code for reading $bitmap file looks like below:

Additional things to look for structures: STARTING_LCN_INPUT_BUFFER and VOLUME_BITMAP_BUFFER

 

#define BITMAP_CHUNK_SIZE 32*1024	// 32 KB is our chunk size to read $bitmap file

int
readVolumeBitmap(
    std::wstring volumeName
    )
{
	HANDLE hVolume;
    STARTING_LCN_INPUT_BUFFER startingLcn;
	VOLUME_BITMAP_BUFFER *volBitmap;
    UINT32 bitmapSize;
    DWORD bytesReturned, bytesWritten;
    BOOL ret, retFile;
    HANDLE hFile;
    std::wstring bitmapFile;
    int result;
	
	/* Open the volume for reading */
	hVolume = CreateFile(
                           volumeName.c_str(),
                           GENERIC_READ,
                           FILE_SHARE_READ | FILE_SHARE_WRITE,
                           NULL,
                           OPEN_EXISTING,
                           0,
                           NULL);

    startingLcn.StartingLcn.QuadPart = 0;
    bitmapSize = BITMAP_CHUNK_SIZE + sizeof(LARGE_INTEGER)*2;
    volBitmap = (VOLUME_BITMAP_BUFFER *) malloc (bitmapSize);

	bitmapFile = L"dummy_bitmap.bin";

    hFile = CreateFile(
                       bitmapFile.c_str(),
                       GENERIC_WRITE,
                       FILE_SHARE_READ | FILE_SHARE_WRITE,
                       NULL,
                       CREATE_ALWAYS,
                       0,
                       NULL);


	/* Read $bitmap file in loop */
    while (TRUE) {
        ret = DeviceIoControl (
                               hVolume,
                               FSCTL_GET_VOLUME_BITMAP,
                               &startingLcn,
                               sizeof(STARTING_LCN_INPUT_BUFFER),
                               volBitmap,
                               bitmapSize,
                               &bytesReturned,
                               NULL);

        if (startingLcn.StartingLcn.QuadPart != m_volBitmap->StartingLcn.QuadPart) {
                /* some ioctl error */
				return -1;
        }
        if (FALSE==ret){
			/* only error we expect is to read more data */
            if (GetLastError() != ERROR_MORE_DATA) {
                /* some ioctl error */
				return -1;
            }
        }

		/* find out the exact bytes read from $bitmap */
        bytesReturned -= sizeof(LARGE_INTEGER)*2;
        retFile = WriteFile (
                             hFile,
                             volBitmap->Buffer,
                             bytesReturned,
                             &bytesWritten,
                             NULL);
		/* If eof occured for reading $bitmap break the loop */
        if (TRUE==ret) {
            break;
        }
        /* Update the read offset for next request */
        startingLcn.StartingLcn.QuadPart += bytesReturned*8;
    }

    CloseHandle(hFile);
	CloseHandle(hVolume);
    free (volBitmap);

    return 0;
}

How to close socket after tiny inactivity timeout?

Q. There is TCP connection, it needs to be closed and re initiated if there is no activity for a timeout period. The timeout period is tiny, of the order of few seconds or minutes?

Ans:
It is not easy to identify a dead or non responsive peer in TCP connection. Normally when the timeout period is long of the order of few hours, you can solve the problem using the TCP keepalive. But it makes no sense to use TCP keepalive when timeout period is tiny, of the order of few seconds. This tiny timeouts can be handled by select system call to check if there is any activity on the socket within the timeout period.

Lets say you’re listening on socket and you expect to read certain packet of known size at least once every 40 seconds, in case you don’t receive the packet you expect to close the TCP connection. You can solve this with following pseudo code.

int
readSocket(
    int fdSocket,
    char *buffer,
    int length
    )
{
    int totalRead;
    int thisTimeRead;
    fd_set readSet;
    int result;
    struct timeval timeout;

    for (totalRead = 0; totalRead < length; ) {
        FD_ZERO(&readSet);
        FD_SET(fdSocket, &readSet);
        timeout.tv_sec= 40;
        timeout.tv_usec = 0;
        /* wait till there is data to be read on the socket or timeout happens */
        result = select(fdSocket+1, &readSet, NULL, NULL, &timeout);
        if (-1 == result) {
            /* Some error in select api */
            if (EINTR == errno) {
                continue;
            }
            /* close socket and return error */
            close(fdSocket);
            return -1;
        }
        /* check if timeout happened */
        if (0 == result) {
            /* timeout, close the socket return error */
            close(fdSocket);
            return -1;
        }

        thisTimeRead = read(fdSocket, buffer+totalRead, length-totalRead);
        if (thisTimeRead < 0) {
            if (EINTR == errno) {
                continue;
            }
            return -1;
        } else if (0 == thisTimeRead) {
            /* EOF detected */
            break;
        }
        /* data is read from the socket, make the connection status active */
        totalRead += thisTimeRead;
    }

    return totalRead;
}

The preferred way to achieve the same effect for both socket read and write operations is by setting socket options: SO_RCVTIMEO and SO_SNDTIMEO

http://linux.die.net/man/7/socket

SO_RCVTIMEO and SO_SNDTIMEO
Specify the receiving or sending timeouts until reporting an error. The argument is a struct timeval. If an input or output function blocks for this period of time, and data has been sent or received, the return value of that function will be the amount of data transferred; if no data has been transferred and the timeout has been reached then -1 is returned with errno set to EAGAIN or EWOULDBLOCK, or EINPROGRESS (for connect(2)) just as if the socket was specified to be nonblocking. If the timeout is set to zero (the default) then the operation will never timeout. Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), and so on.

Berkeley DB performance and disk configuration

In this post I want to highlight the impact of underlying disk configuration on the performance of Berkeley DB (BDB).
The performed test run are as below:
1. The BDB is B_TREE. It has only Key and not data. Key length is 24 bytes.
2. A random key is generated it is looked up for existence in the db, if it does not exist in db, it is added in the db.
3. Same set of keys in same sequence were generated for below tests.
4. First the performance was calculated on db stored on a single hard disk.
5. Then same performance test were done on db stored on single SSD.
6. Last the performance test were done on db stored on striped volume consisting 3 disks.

Check the plotting of performance graph:

Disk configuration impact on BDB performance

Disk configuration impact on BDB performance

The x-axis in the graph has Size in TB, translate the 1 unit on x axis to ~ 8 million unique keys in BDB.

These tests were done to decide a storage appliance configuration. To keep the cost of appliance low we decided to use striped volume instead of SSD. We can get even better performance if the DB is stored on striped volume created on top of SSDs.

How to parse a TAR file?

Q. From a tar file how to list it’s contents in the form of filename and their sizes?

Ans:

Details of the tar file format can be found here: http://en.wikipedia.org/wiki/Tar_%28computing%29

Original Tar file format had two problems:
1. archiving file of size greater than 8 GB
2. archiving file whose name is > 100 bytes.

Below is the tar header format:

{
    char name[100];
    char mode[8];
    char uid[8];
    char gid[8];
    unsigned char size[12];
    char mtime[12];
    char chksum[8];
    char typeflag;
    char linkname[100];
    char magic[6];
    char version[2];
    char uname[32];
    char gname[32];
    char devmajor[8];
    char devminor[8];
    char prefix[155];
    char pad[12];
}

These problems later got fixed in GNU tar format.

a. For parsing file size use below rules:
Numeric values are encoded in octal numbers using ASCII digits, with leading zeroes. For historical reasons, a final NUL or space character should be used. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation, star in 2001 introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field. GNU-tar and BSD-tar followed this idea.

b. For parsing file name use below rules:
If the first character of prefix is \0 (null character), the file name is name; otherwise, it is prefix/name. Files whose pathnames don’t fit in that length can not be stored in a tar archive.

c. LongLink rule:
I found that though rule ‘b’ is documented it is not followed. Large filenames are stored in tar using LongLink concept. It is special tar entry used to store only large filenames. The LongLink type header has typeflag ‘L’. The file name for this header is “././@LongLink“. The data content for this file are nothing but the “long” filename for the next archive entry.

To generate a “gnu” format tar use below command in shell:

$ tar --format=gnu -cvf xyz.tar dir

Following code parse a “gnu” tar file and print the file name & file size of all the archived files.

struct GnuTarHeader
{
    char name[100];
    char mode[8];
    char uid[8];
    char gid[8];
    unsigned char size[12];
    char mtime[12];
    char chksum[8];
    char typeflag;
    char linkname[100];
    char magic[6];
    char version[2];
    char uname[32];
    char gname[32];
    char devmajor[8];
    char devminor[8];
    char prefix[155];
    char pad[12];
};
 
void validateTarHeader(GnuTarHeader *tarHeader);
void parseFileSize(GnuTarHeader *tarHeader);
void parseFileName(GnuTarHeader *tarHeader);
void parseLongLink(GnuTarHeader *tarHeader, int fd);
void parseTarHeader(GnuTarHeader *tarHeader);
 
std::string currentFileName;
unsigned long long currentFileSize;
bool lastLongLinkHeader;
 
char TAR_MAGIC[] = "ustar ";
 
int main(int argc, char **argv)
{
    int fd;
    int ret;
    unsigned long long seek;
    GnuTarHeader gnuHeader, emptyHeader;
    int emptyHeaders = 0;
 
    if (argc != 2) {
        printf ("Usage: %s tar_file_name\n", argv[0]);
        return 1;
    }
 
    fd = open(argv[1], O_RDONLY);
    assert (fd != -1);
 
    memset (&emptyHeader, 0, 512);
 
    while (1) {
        ret = read(fd, &gnuHeader, 512);
        assert(ret == 512);
        if (0 == memcmp(&gnuHeader, &emptyHeader, 512)) {
            emptyHeaders ++;
            if (2 == emptyHeaders) {
                break;
            }
            continue;
        }
        emptyHeaders = 0;
        validateTarHeader(&gnuHeader);
        if ('L' == gnuHeader.typeflag) {
            parseLongLink(&gnuHeader, fd);
        } else {
            parseTarHeader(&gnuHeader);
            seek = (currentFileSize/512) + (currentFileSize%512 ? 1 : 0);
            seek *= 512;
            seek = lseek(fd, seek, SEEK_CUR);
            assert(seek != -1);
        }
    }
 
    return 0;
}
 
void validateTarHeader(GnuTarHeader *tarHeader)
{
    for (int i=0; i<6; i++) {
        assert(tarHeader->magic[i] == TAR_MAGIC[i]);
    }
}
 
void parseFileSize(GnuTarHeader *tarHeader)
{
    int i;
 
    // parse the file size.
    currentFileSize = 0;
 
    if (tarHeader->size[0] & (0X01 << 7)) {
        // file size > 8 GB.
        for (i=1; i<12; i++) {
            currentFileSize *= 256;
            currentFileSize += tarHeader->size[i];
        }
    } else {
        // file size < 8 GB.
        for (i=0; i<12; i++) {
            if ((0 == tarHeader->size[i]) || (' ' == tarHeader->size[i])) {
                continue;
            }
            currentFileSize *= 8;
            currentFileSize += (tarHeader->size[i] - '0');
        }
    }
}
 
void parseFileName(GnuTarHeader *tarHeader)
{
    int i;
    char fileName[256];
 
    currentFileName = "";
 
    if (0 != tarHeader->prefix[0]) {
        for (i=0; i<155; i++) {
            if (0 == tarHeader->prefix[i]) {
                break;
            }
            fileName[i] = tarHeader->prefix[i];
        }
        fileName[i] = '\0';
        currentFileName = fileName;
        currentFileName += "//";
    }
 
    for (i=0; i<100; i++) {
        if (0 == tarHeader->name[0]) {
            break;
        }
        fileName[i] = tarHeader->name[i];
    }
 
    fileName[i] = '\0';
    currentFileName += fileName;
}
 
void parseLongLink(GnuTarHeader *tarHeader, int fd)
{
    int ret;
    char fileName[512+1]; // last byte for '\0''
 
    currentFileName = "";
    parseFileSize(tarHeader);
    while (true) {
        ret = read (fd, fileName, 512);
        if (currentFileSize > 512) {
            fileName[512] = '\0';
        } else {
            fileName[currentFileSize] = '\0';
            currentFileName += fileName;
            break;
        }
        currentFileSize -= 512;
        currentFileName += fileName;
    }
 
    lastLongLinkHeader = true;
}
 
void parseTarHeader(GnuTarHeader *tarHeader)
{
    parseFileSize(tarHeader);
 
    // parse the filename.
    if (false == lastLongLinkHeader) {
        parseFileName(tarHeader);
    }
 
    lastLongLinkHeader = false;
    printf ("%s %llu\n", currentFileName.c_str(), currentFileSize);
}

How to handle signal SIGPIPE?

Q. C program on Unix crashed after it attempted to write on a closed socket. How to handle this?

Ans:
In Unix environment, when a process attempts to write on a socket which is shutdown for writing or which is not connected (anymore), OS sends signal SIGPIPE to that process.
Default behavior of the signal SIGPIPE is to terminate the process. To avoid this your process must ignore the signal SIGPIPE. In that case if your process attempts to write on a closed socket, the send() / write() call returns -1 and errno is set to EPIPE. Your process should handle this failure in the write accordingly. This is how you can stop abruptly crashing of your program.

Below C code shows how to ignore signal SIGPIPE.

{
    int ret;
    struct sigaction sa;

    /*
     * There are chances that our application sends 
     * data to a closed socket. This generates SIGPIPE
     * signal, and results into process termination.
     * We must ignore this signal, as our application is 
     * equipped to handle errors happening on socket writes.
     */

    sa.sa_handler = SIG_IGN;
    sa.sa_flags = 0;
    ret = sigaction(SIGPIPE, &sa, NULL);
    assert(-1 != ret);
}

Windows P2V machine’s boot error status: 0xC000000E

Q. I’m working on developing a P2V solution for Windows machines. For one client I got the boot error: 0xC000000E (Info: The boot selection failed because a required device is inaccessible”). How to fix the boot error 0xC000000E for a Virtual Machine?

Ans:
If this problem happens for physical machine check below link:
http://www.prime-expert.com/articles/b16/fix-0xC000000E-required-device-is-inaccessible.php

The error looks like this:

boot_error_status_0xC000000E

The explanation of the error is given nicely at:

http://www.prime-expert.com/articles/b16/fix-0xC000000E-required-device-is-inaccessible.php

Process to fix the corrupt BCD:

The tool used for fixing this is: BCDboot.
1. Mount the virtual hard disk VHD files on the Windows which contains Boot & System partition. (Boot & System partition can be on two different disk.) To know more about drive types check: http://support.microsoft.com/kb/314470
2. Lets assume your Boot partition is mounted at B:\ and System partition is mounted at S:\
3. Then use below commands to fix the corrupt BCD.

Save original BCD data

 cd S:\Boot
 ren BCD BCD.old

Run BCDboot to fix corrupt BCD:

This will fix both UEFI & BIOS based booting.

 BCDboot.exe B:\Windows /s S: /f ALL

If your HyperVisor doesn’t support UEFI booting use below command:

 BCDboot.exe B:\Windows /s S:

This will fix your 0xC000000E boot problem.

Windows Random File Generator

Q. Generate random files in Windows. Let the random files / directories generated have compression or encryption attribute set randomly. Also the files as well as directories should have the alternate data streams (ADS) associated with. There should be provision of updating certain parts of this randomly generated data. This kind of utility is very useful to test file system backup products.

Ans: For long time I was searching for the Unix like “dd” command line utility for Windows.  I came across this utility: Random Data File Creator (RDFC). This gives us “dd” kind of functionality. Download this utility and keep it in your C:\

Use below batch script to generate random files and directories with compression / encryption attributes and alternate data streams associated with them.

echo off
set testname=%0
set foldername=%1
set fext=%2
set size=%3
set unit=%4
set no=%5
set outerfolder=%6
set innerfolder=%7
set /A filenumber = 1

set dest=%foldername%
set str=%dest: =%
mkdir %str%
set /A a=0
:Loop1
	set dest1=%str%\%a%
	set str1=%dest1: =%
	mkdir %str1%
	set /A b=0
	:Loop2
		set dest2=%str1%\%b%
		set str2=%dest2: =%
		mkdir %str2%
		set /A c=0
		:Loop3
			set fname=%str2%\file-%filenumber%.%fext%
			set tmpfname=%str2%\file_tmp
			c:\rdfc.exe %tmpfname% %size% %unit% overwrite
			type nul >> %fname%
			:: append data to existing file
			type %tmpfname% >> %fname%
			:: create alternate data stream
			set /A adsproperty = %filenumber% %% 2
			IF %adsproperty%==1 (type %tmpfname% >> %fname%:my_file_ads)
			del %tmpfname%
			set /A filenumber = %filenumber% + 1
			set /A c=%c%+1
			IF %c%==%no% (goto end3) ELSE (goto loop3)
		:end3
		rem *******************
		:: set compression / encryption attribute
		set /A dirproperty = %b% %% 3
		IF %dirproperty%==1 (compact /C /S:%str2%)
		IF %dirproperty%==2 (cipher /E /S:%str2%)
		set tmpfname=%str1%\file_tmp
		c:\rdfc.exe %tmpfname% %size% %unit% overwrite
		set /A adsproperty = %b% %% 2
		IF %adsproperty%==1 (type %tmpfname% >> %str2%:my_dir_ads)
		del %tmpfname%
		set /A b=%b%+1
		IF %b%==%innerfolder% (goto end2) ELSE (goto loop2)
	:end2
	rem *******************
	set /A a=%a%+1
	IF %a%==%outerfolder% (goto end1) ELSE (goto loop1)
:end1

:: usage random_files.bat foldername EXTENTION SIZE UNIT[B|kB|MB|GB] NO_OF_FILES no_outerfolders no_innerfolders

Usage:

C:\>random_files.bat D:\test bin 1 kB 2 3 3

This will generate D:\test folder. Inside this folder there will be 3 subfolders created numbered: 0, 1, 2. Inside each of these folders again there will be three subfolders numbered 0, 1, 2. Inside each of these folders there will be 2 random 1 KB files with .bin extension are created. Starting from the folder named 1, each of the third inner folder and its contents will be compressed. Starting from inner folder numbered 2, each of the third folder and its contents will be encrypted. Each of the second file created will have an alternate data stream (ADS) associated with it. Each of the second inner folder created will have alternate data stream (ADS) associated with it.

C:\>random_files.bat D:\test bin 512 B 4 3 1

Running this command for the second time only updates first two files in the 0’th inner folder in all the three outer folders. Now those first two files will have size 1.5 kB. (The new data is appended). Apart from this two new files of size 512 bytes are also created in each of the 0’th inner folder.

You can have any permutation to generate random data and keep on updating that data with any combination 🙂

 

Windows how to find disk number of mounted VHD?

Q. Once you mount a VHD on Windows, how to find the disk number ‘n’ of the mounted vhd? This is required to access the mounted VHD in the disk form as \\.\PhyiscalDriven.

Ans:

Use below diskpart command, it lists all the mounted VHDs and their corresponding disk numbers.

DISKPART> list vdisk
VDisk ### Disk ### State Type File
 --------- -------- -------------------- --------- ----
 VDisk 0 Disk 2 Attached not open Expandable S:\UBMD.vhd
 VDisk 1 Disk 1 Attached not open Expandable C:\resOCB\test_0.vhd

From the example S:\UBMD.vhd is mounted at disk number 2 and C:\resOCB\test_0.vhd is mounted at disk number 1.