Hi all! I'm starting a new article series here. This one is going to be about Unix utilities that you should know about. The articles will discuss one Unix program at a time. I'll try to write a good introduction to the tool and give as many examples as I can think of.
The first post in this series is going to be about a not so well known but super powerful Unix program called Pipe Viewer or pv for short. Pipe viewer is a terminal-based tool for monitoring the progress of data through a pipeline. It can be inserted into any normal pipeline between two processes to give a visual indication of how quickly the data is passing through, how long it has taken, how near to completion it is, and an estimate of how long it will be until completion.
Pipe viewer is written by Andrew Wood, an experienced Unix sysadmin. The homepage of pv utility is here: pv utility.
If you feel like you are interested in this stuff, I suggest that you subscribe to my rss feed to receive my future posts automatically.
How to use pv?
Let's start with some really easy examples and progress to more complicated ones.
Suppose that you have a file access.log
that is a tens of gigabytes in size and contains web logs. You want to compress it into a smaller file, let's say a gunzip archive (.gz). The obvious way to do it is:
$ gzip -c access.log > access.log.gz
As the file is so huge (tens of gigabytes), you have no idea how long to wait. Will it finish soon? Or will it take another 30 mins?
By using pv
you can precisely time how long it will take:
$ pv access.log | gzip > access.log.gz <strong>611MB 0:00:11 [58.3MB/s] [=> ] 15% ETA 0:00:59</strong>
Pipe viewer acts as cat
here, except it also adds a progress bar. We can see that gzip processed 611MB of data in 11 seconds. It has processed 15% of all data and it will take 59 more seconds to finish. So no coffee break.
You can stick several pv
processes in your pipeline. For example, you can time how fast the data is being read from the disk with one pv
and how much data has been gzipped via a second pv
:
$ pv -cN source access.log | gzip | pv -cN gzip > access.log.gz <strong>source: 760MB 0:00:15 [37.4MB/s] [=> ] 19% ETA 0:01:02 gzip: 34.5MB 0:00:15 [1.74MB/s] [ <=> ]</strong>
Here we have specified the -N
parameter to pv
to create a named stream. The -c
parameter makes sure the output is not garbaged by one pv process writing over the other.
This example shows that the access.log
file is being read at the speed of 37.4MB/s but gzip is writing data at only 1.74MB/s. We can immediately calculate the compression rate. It's 37.4/1.74 = 21x!
Notice how gzip
doesn't include how much data is left or how fast it will finish. It's because the pv
process after gzip
has no idea how much data gzip
will produce (it's just outputting compressed data from input stream). The first pv process, however, knows how much data is left, because it's reading it from a known file.
Another similar example is be to pack the whole directory of files into a compressed tarball:
$ tar -czf - . | pv > out.tgz <strong> 117MB 0:00:55 [2.7MB/s] [> ]</strong>
In this example, pv
only shows the output rate of the tar -czf
command. It has no information about how bit the directory is or how long the tar
process will run or how much data is left. We need to provide the total size of data we are tarring to pv. It can be done this way:
$ tar -cf - . | pv -s $(du -sb . | awk '{print $1}') | gzip > out.tgz <strong> 253MB 0:00:05 [46.7MB/s] [> ] 1% ETA 0:04:49</strong>
What happens here is we tell tar to recursively (default mode) create (-c
argument) an archive of all files in current dir (.
argument) and output the data to stdout -f - argument
. Next, we specify the size -s
argument to pv of all files in current dir and all its subdirectories. The du -sb . | awk '{print $1}'
command returns number of bytes in current dir and it's fed as the -s
parameter to pv. Next, we gzip the content and output the result to out.tgz
file. This way pv
knows how much data is still left to be processed and shows us that it will take another 4 mins 49 secs to finish. So you can take a quick coffee break.
Another interesting example is copying large amounts of data over the network via the nc
(netcat) utility that I will write about some other time.
(Update: Just wrote about it: Netcat – A Unix Utility You Should Know About.)
Suppose you have two computers A and B. You want to transfer a directory from A to B very quickly. The fastest way to do it is to use tar
and nc
, and time the operation with pv
.
On computer A with IP address 192.168.1.100 run this command:
$ tar -cf - /path/to/dir | pv | nc -l -p 6666 -q 5
On computer B run this command:
$ nc 192.168.1.100 6666 | pv | tar -xf -
That's it! All the files in /path/to/dir
on computer A will get transferred to computer B, and you'll be able to see how fast the operation is going.
This will show how fast the data is being transferred but it won't show how much data is left. If you want this information, then you have to do the pv -s $(...)
trick from the previous example and add it to pv
on computer A.
Here's another fun example. It shows how fast the computer reads from /dev/zero
:
$ pv /dev/zero > /dev/null 157GB 0:00:38 [4.17GB/s]
That's it. I hope you enjoyed this post and learned something new. I love explaining things and teaching!
How to install pv?
If you're on Debian or Debian based system such as Ubuntu do the following:
$ sudo aptitude install pv
If you're on Fedora or Fedora based system such as CentOS do:
$ sudo yum install pv
If you're on Mint, do:
$ sudo apt-get install pv
If you're on Slackware, go to pv homepage, download the pv-version.tar.gz archive and do:
$ tar -zxf pv-version.tar.gz $ cd pv-version $ ./configure && sudo make install
If you're a Mac user:
$ sudo port install pv
If you're OpenSolaris user:
$ pfexec pkg install pv
If you're a Windows user on Cygwin:
$ ./configure $ export DESTDIR=/cygdrive/c/cygwin $ make $ make install
The manual of the utility can be found here man pv.
Have fun measuring your pipes with pv and until next time!