README.p4watch #1

*** These notes are a sad excuse for "documentation", but are aimed at
*** getting this to a potential user quickly.  As time permits, I'll
*** try to improve "p4watch", add some genuine documentaiton, and
*** better packaging.

     - rmg 9/19/2001 

=== p4watch Notes ===

"p4watch" is a tool for tracking down Perforce performance problems,
when you suspect that something is periodically causing Perforce
commands which _should_ be quick to execute to get "the slows".

It works by periodically running a (nominally) short-running "p4"
command, ("p4 describe 1", by default), and watching for the command
to complete.  If the command has not completed given timeout duration,
then a "p4wd" is done, in order to log system state.

After "p4wd" has been run, it will also be run again after the
original "p4" probe command finally completes. (This makes it easy to
look for other p4d activity which was active at the time the probe
command timed out, but which is gone after the probe command finally
completed; these are sometimes good suspects for what caused the
slowness.

Optionally, a "vmstat" on the server host and a "sysstat" on the filer
(each of which produce periodic output) can be requested. When these
are used, p4watch captures periodic vmstst and sysstat output for the
entire duration of p4watch.

All output from all of this goes to the standard output, and the ttt
logfile (see below), with labeling and timestamping that help to see
what happened when (or to extract particular information of interest,
usging grep).

"p4watch" uses a library called "ttt". In particular, you'll need:

  tttLib.pl	           - the main ttt library
  tttAgents.pl          - the ttt agent library
  ttt_generic_opts.pl   - generic options for the ttt library
  
These should reside in the same directory a the "p4watch" script.
(For now, at least)

"p4watch" also requires the Time::HiRes perl module (available from
CPAN).

Finally, you'll need a working "p4wd" on your p4d host.

If you choose to enable the vmstat logging, the "vmstat" command
issued may need to be adjusted for the operating system you are
running your Perforce server on.

"p4watch" can be run from any host or as any user, provided that:

   - you'll need to run on a host (and as a user) from which you can
     run "rsh <p4dhost> p4wd". (I.e., you can run p4watch itself
     directly on your server host, or some other host)

   - if you enable the "sysstatint" option (filer sysstat output), you'll
     need to run "p4watch" from a host (and as a user) that
     can run "rsh <filer> sysstat".

   - if you enable the "vmstatint" option (p4d host sysstat logging),
     you'll need to run "p4watch" from a host (and as a user) that
     can run "rsh <p4dhost> vmstat".
     
All "p4watch" arguments can be set either on the command line,
using, e.g.,

  p4watch p4wdthresh=10 srv_host=charley user=rmg p4cmdint=120

or in the environment, e.g.,

  srv_host=charley; export srv_host
  p4cmdint=120; export p4cmdint
  p4watch

I like to use a shell script to start p4watch, e.g.:

| #!/bin/sh
| 
| p4watch \
|   filer_user=rmg:rmg1pwd \
|   logfile=log \
|   p4cmd=files \
|   p4cmdint=10 \
|   p4wdthresh=5 \
|   sysstatint=3 \
|   vmstatint=3 \
|   srv_port=1678 \
|   srv_host=localhost \
|   p4wd="p4wd -p4d_pid 5068 -p4d_log /home/p4/root-cvs2p4/log 1000; top b n 1"

To get a brief description of the arguments (and their defaults),
use

  p4watch =options

Here's a bit more about the options:

  p4cmd              the short running p4 command to run as a probe

  p4wd	             the p4wd command to execute.
     (You may need to supply p4wd arguments to fit your server,
     as shown in the example above; also not that you can tack on other
     commands to be run after p4wd - a "top b n 1" in the example above).

  p4wdthresh         threshold above which to run p4wd
     i.e., if the "$p4dcmd" doesn't finish or yeild output for
     this many seconds, run a p4wd.

  vmstatint          vmstat interval argument
     setting this causes "vmstat <vmstatint>" to be run on the p4d host

  sysstatint         filer sysstat interval argument
     setting this causes "sysstat <sysstatint> to be run on the filer

  p4cmdint           p4 describe interval
     this is the frequency at which "p4 describe" probes should be run

  srv_host           host p4d is running on
     The host where "p4d" is running

  srv_port           port p4d is listening on
     The port on srv_host where p4d is listening

  srv_user           user to run as on srv_host
     all command on srv_host will be run with "rsh -l <srv_user> ..."

  cli_host           host to run "p4 describe" probes on
     The host where the "p4 describes" should be run

  cli_user           user to run as on cli_host
     all command on cli_host will be run with "rsh -l <cli_user> ..."

  filer              the filer hostname
     
  filer_user         user to run as on the filer
     all command on the filer will be run with "rsh -l <filer_user> ..."

I suggest that, before attempting to run "p4watch", try running
"testAgents"; the output should look like

| rmg $ testAgents
|         : logging to: /home/rmg/web/richard_geiger/guest/richard_geiger/p4bench/ttt_log
| execute : 14498 on peggyo: date
| execute : 14499 on peggyo: date
| execute : 14500 on peggyo: date
| 14500   : STDOUT: Thu Sep 13 17:58:01 PDT 2001
| 14498   : STDOUT: Thu Sep 13 17:58:01 PDT 2001
| 14499   : STDOUT: Thu Sep 13 17:58:01 PDT 2001
| 14500   : STDERR: EOF
| 14500   : STDOUT: EOF
| 14498   : STDERR: EOF
| 14498   : STDOUT: EOF
| 14499   : STDERR: EOF
| 14499   : STDOUT: EOF
| 14498   : exit status = 0
| 14499   : exit status = 0
| 14500   : exit status = 0
| exit    : pass

And then, perhaps another one, asking to run "uname" on the p4dhost
("computer", in this example):

| rmg $ testAgents onhost1=computer oncmd1=hostname
|         : logging to: /home/rmg/web/richard_geiger/guest/richard_geiger/p4bench/ttt_log
| execute : 5871 on computer: hostname
| execute : 5872 on peggyo: date
| execute : 5873 on peggyo: date
| 5872    : STDOUT: Wed Sep 19 15:09:49 PDT 2001
| 5873    : STDOUT: Wed Sep 19 15:09:49 PDT 2001
| 5872    : STDERR: EOF
| 5872    : STDOUT: EOF
| 5873    : STDERR: EOF
| 5873    : STDOUT: EOF
| 5871    : STDOUT: computer.perforce.com
| 5871    : STDERR: EOF
| 5871    : STDOUT: EOF
| 5871    : exit status = 0
| 5872    : exit status = 0
| 5873    : exit status = 0
| exit    : pass

(This will verify that the "ttt" library is working right; I think it
should work fine on most Unixes, but if not, please let me know.)

For an example p4watch session, see EXAMPLE_p4watch_log.

*** Note: to make sense of the following, I suggest using a really wide window!:

Entries in the log file look like

  2001-09-19 14:26:26: execute : 5646 on localhost: uname -a
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^ message
                       ^^^^^^^^^ tag
  ^^^^^^^^^^^^^^^^^^^ timestamp

The "tag"s serve to identiy line, since they are logged in time order,
and sometimes this creates interleaved stuff you want to pull apart.
Try, e.g.,

  grep ": vmstat  :" EXAMPLE_p4watch_log

Additionally, lines that log output from child processes have,

2001-09-19 14:26:26: 5646    : STDOUT: Linux peggyo 2.4.2-2 #1 Sun Apr 8 20:41:30 EDT 2001 i686 unknown
			       ^^^^^^^ stream id

Note that the "tag" for these lines can either be the pid of the
process that produced the output (as above), or a symbolic name,
assigned when when the process was started, e.g.,

2001-09-19 14:26:26: vmstat  : STDOUT:    procs                      memory    swap          io     system         cpu
                     ^^^^^^

p4watch uses the tag ": trip   :" when slowness is detected:

2001-09-19 14:26:50: sysstat : STDOUT:   2%     0     0     0       0     0     0     0     0     0     0   >60  100%   0%  -   0%
2001-09-19 14:26:53: vmstat  : STDOUT:  0  0  0   2564   4996  28796  68904   0   0     0     0  111    56   0   0 100
2001-09-19 14:26:53: sysstat : STDOUT:   2%     0     0     0       0     0     0     0     0     0     0   >60  100%   0%  -   0%
2001-09-19 14:26:54: trip    : do p4wd after waiting 5.993265 seconds
2001-09-19 14:26:54: execute : 5656 on localhost: p4wd -p4d_pid 5068 -p4d_log /home/p4/root-cvs2p4/log 1000; top b n 1
2001-09-19 14:26:54: p4wd    : STDOUT:  5068*5068  3981 p4d              Wed Sep 19 13:35:18 2001  768 00:00:00 p4d -p 1678 
                               ^^p4wd output^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                 
  Note: in this example, the server is very unbusy, so you only see
  the "master" p4d process in the p4wd output above. (In fact I caused
  the timeout in this example to trip by stopping the server with
  job-control!)

2001-09-19 14:26:55: p4wd    : STDOUT: 
2001-09-19 14:26:55: p4wd    : STDOUT: 
2001-09-19 14:26:55: p4wd    : STDOUT:   2:26pm  up 42 days,  5:20,  2 users,  load average: 0.00, 0.00, 0.00
2001-09-19 14:26:55: p4wd    : STDOUT: 84 processes: 78 sleeping, 1 running, 1 zombie, 4 stopped
#	Change	User	Description	Committed
#1	987	Richard Geiger	Add p4watch, despite it's wholly unproductized state.