$ cat /p4/1/logs/checkpoint.log
Fri Oct 30 05:07:00 UTC 2020 /p4/common/bin/daily_checkpoint.sh: Start p4_1 Checkpoint
Fri Oct 30 05:07:01 UTC 2020 /p4/common/bin/daily_checkpoint.sh: Offline journal number is: 22
Fri Oct 30 05:07:01 UTC 2020 /p4/common/bin/daily_checkpoint.sh: Skipping call to truncate_journal() on edge or replica server.
Fri Oct 30 05:07:01 UTC 2020 /p4/common/bin/daily_checkpoint.sh: Replay any unreplayed journals to the offline database.
Fri Oct 30 05:07:01 UTC 2020 /p4/common/bin/daily_checkpoint.sh: Replay journal /depotdata/p4/1/checkpoints.replica.1/p4_1.replica.1.jnl.22 to offline db.
Recovering from /depotdata/p4/1/checkpoints.replica.1/p4_1.replica.1.jnl.22...
real 0m0.027s
user 0m0.004s
sys 0m0.013s
Fri Oct 30 05:07:01 UTC 2020 /p4/common/bin/daily_checkpoint.sh: Replay journal /depotdata/p4/1/checkpoints.replica.1/p4_1.replica.1.jnl.23 to offline db.
Recovering from /depotdata/p4/1/checkpoints.replica.1/p4_1.replica.1.jnl.23...
real 0m0.023s
user 0m0.004s
sys 0m0.012s
Fri Oct 30 05:07:01 UTC 2020 /p4/common/bin/daily_checkpoint.sh: Replay journal /depotdata/p4/1/checkpoints.replica.1/p4_1.replica.1.jnl.24 to offline db.
Recovering from /depotdata/p4/1/checkpoints.replica.1/p4_1.replica.1.jnl.24...
Perforce server error:
open for read: /depotdata/p4/1/checkpoints.replica.1/p4_1.replica.1.jnl.24: No such file or directory
open for read: /depotdata/p4/1/checkpoints.replica.1/p4_1.replica.1.jnl.24: No such file or directory
real 0m0.007s
user 0m0.003s
sys 0m0.004s
Fri Oct 30 05:07:01 UTC 2020 /p4/common/bin/daily_checkpoint.sh: ERROR!!! - replica p4_1 /p4/common/bin/daily_checkpoint.sh: Offline journal replay failed. Abort!
$ ls -al /p4/1/checkpoints.replica.1/
total 40
drwx------. 2 p4 p4 267 Oct 30 05:08 .
drwx------. 6 p4 p4 81 Oct 30 03:44 ..
-r--r-----. 1 p4 p4 2838 Oct 30 04:54 p4_1.replica.1.jnl.15
-r--r-----. 1 p4 p4 4126 Oct 30 04:55 p4_1.replica.1.jnl.16
-r--r-----. 1 p4 p4 1940 Oct 30 04:55 p4_1.replica.1.jnl.17
-r--r-----. 1 p4 p4 1940 Oct 30 04:55 p4_1.replica.1.jnl.18
-r--r-----. 1 p4 p4 1940 Oct 30 04:55 p4_1.replica.1.jnl.19
-r--r-----. 1 p4 p4 2838 Oct 30 04:56 p4_1.replica.1.jnl.20
-r--r-----. 1 p4 p4 2838 Oct 30 04:57 p4_1.replica.1.jnl.21
-r--r-----. 1 p4 p4 3228 Oct 30 05:06 p4_1.replica.1.jnl.22
-r--r-----. 1 p4 p4 1940 Oct 30 05:06 p4_1.replica.1.jnl.23
The daily_checkpoint.sh script isn't intended to work on a forwarding replica; it should only be run on a master or an edge server. I set the Component field of this job to 'doc' to add the needed clarification. Here are some bits:
First, none of the SDP scripts interfere with 'p4d pull' real-time replication. It sets standards and conventions for how replication is setup, but none of the scripts do anything to a replica once it is running. (Well, except for load_checkpoint.sh that blasts and reseeds).
For forwarding replicas, there are a few scripts you might choose, depending on your goals. You may want to use either sync_replica.sh (or replica_cleanup.sh), and optionally request_replica_checkpoint.sh.
sync_replica.sh: This keeps the offline_db on the replica in sync with the master, by rsyncing checkpionts from the master and replaying to the offline_db, as well as doing various and sundry tasks like log rotation and compression and cleanup. Since it pulls checkpoints taken on the master, it is only appropriate if the replica is not filtered in any way. As an alternative, the replica_cleanup.sh script skips the rsync and the offline_db maintenance, and just cleanup stuff.
There is also request_replica_checkpoint.sh. That's pretty much just a wrapper to the 'p4 admin checkpoint -Z' command. That will cause the replica to execute a checkpoint on the next rotation detected from the replica's P4TARGET server. This is ideal for forwarding replicas that are filtered, and also can be used for the moral equivalent of a live checkpoint on an edge server (when combined with rotate_journal.sh run on the master to trigger the checkpoint on the edge to start when you want).
Note that taking checkpoints of unfiltered forwarding replicas is NOT recommended. If you want to offload checkpoint creation from the master, use a standby replica instead. However, that is not recommended, as checkpoints taken using daily_checkpoint.sh using offline_db of the master use the simplest possible mechanism, and thus the most reliable. That said, taking checkpoints on standby replicas is a viable alternative. Also note that the 'p4 failover' command only supports failing over to replicas of type 'standby' or 'forwarding-standby'.
Unfiltered replicas (with a Services value of 'replica' or 'forwarding-replica') don't really need checkpoints. Filtered replicas or forwarding replicas may be worth checkpointing as they have a data set different from the master, albeit a mere subset of data from the master.
So would the checkpoint created by using the request_replica_checkpoint.sh be different than the checkpoint generated by adding support to the SDP script to replay the filtered replica journals to it's offline db and creating a checkpoint off of that?
The drawback I see from requesting a checkpoint using 'p4 admin checkpoint -Z' is that if we have a large filtered replica server, replaying journals and creating a checkpoint could take a significant amount of time. During this time users would not be able to use the server until the process finishes right?
That's correct! And if your forwarding replica is filtered and actively used, I can see why you'd want daily_checkpoint.sh to run on it -- to have local offline checkpoints of the filtered replica's data set.
I can think of two things that might help:
Come up with a variation on the sync_replica.sh them that works on filtered replicas.
Make daily_checkpoint.sh "just work" on a filtered replica (per the original request).
Since a filtered replica is a strict subset of the master, you could always created a new seed checkpoint from the master data set (using something like:
p4d_1 -r /p4/1/offline_db -J off -z -P FilteredReplicaServerID -jd /p4/1/checkpoints/p4_1.ckp.FilteredReplicaServerID.NNN.gz'
So, Option 1 would be based on that. The advantage would be a new seed checkpoint taken on the master's offline_db would be the most reliable.
Option 2 would be replaying local (filtered) journals to the local offline_db, and generating local offline checkpoints. That should work, but it's a copy-of-a-copy thing, and might suffer some fidelity loss in certain (rare) situations.
So, I changed this component from 'doc' back to 'core-unix' (though with implied doc changes needed), since this would take a code change to implement, now that I understand the use case more fully. I also changed it form 'Bug' to 'Feature' the current script is intentionally not working for filtered replicas; adding support for that is something new.
The daily_checkpoint.sh script isn't intended to work on a forwarding replica; it should only be run on a master or an edge server. I set the Component field of this job to 'doc' to add the needed clarification. Here are some bits:
First, none of the SDP scripts interfere with 'p4d pull' real-time replication. It sets standards and conventions for how replication is setup, but none of the scripts do anything to a replica once it is running. (Well, except for load_checkpoint.sh that blasts and reseeds).
For forwarding replicas, there are a few scripts you might choose, depending on your goals. You may want to use either sync_replica.sh (or replica_cleanup.sh), and optionally request_replica_checkpoint.sh.
sync_replica.sh: This keeps the offline_db on the replica in sync with the master, by rsyncing checkpionts from the master and replaying to the offline_db, as well as doing various and sundry tasks like log rotation and compression and cleanup. Since it pulls checkpoints taken on the master, it is only appropriate if the replica is not filtered in any way. As an alternative, the replica_cleanup.sh script skips the rsync and the offline_db maintenance, and just cleanup stuff.
There is also request_replica_checkpoint.sh. That's pretty much just a wrapper to the 'p4 admin checkpoint -Z' command. That will cause the replica to execute a checkpoint on the next rotation detected from the replica's P4TARGET server. This is ideal for forwarding replicas that are filtered, and also can be used for the moral equivalent of a live checkpoint on an edge server (when combined with rotate_journal.sh run on the master to trigger the checkpoint on the edge to start when you want).
Note that taking checkpoints of unfiltered forwarding replicas is NOT recommended. If you want to offload checkpoint creation from the master, use a standby replica instead. However, that is not recommended, as checkpoints taken using daily_checkpoint.sh using offline_db of the master use the simplest possible mechanism, and thus the most reliable. That said, taking checkpoints on standby replicas is a viable alternative. Also note that the 'p4 failover' command only supports failing over to replicas of type 'standby' or 'forwarding-standby'.
Unfiltered replicas (with a Services value of 'replica' or 'forwarding-replica') don't really need checkpoints. Filtered replicas or forwarding replicas may be worth checkpointing as they have a data set different from the master, albeit a mere subset of data from the master.
I'll also make a change so that daily_checkpoint.sh generates an error immediately if run on a server type that it wasn't intended to run on.
Thanks for the detailed explanation.
So would the checkpoint created by using the request_replica_checkpoint.sh be different than the checkpoint generated by adding support to the SDP script to replay the filtered replica journals to it's offline db and creating a checkpoint off of that?
The drawback I see from requesting a checkpoint using 'p4 admin checkpoint -Z' is that if we have a large filtered replica server, replaying journals and creating a checkpoint could take a significant amount of time. During this time users would not be able to use the server until the process finishes right?
That's correct! And if your forwarding replica is filtered and actively used, I can see why you'd want daily_checkpoint.sh to run on it -- to have local offline checkpoints of the filtered replica's data set.
I can think of two things that might help:
Since a filtered replica is a strict subset of the master, you could always created a new seed checkpoint from the master data set (using something like:
p4d_1 -r /p4/1/offline_db -J off -z -P FilteredReplicaServerID -jd /p4/1/checkpoints/p4_1.ckp.FilteredReplicaServerID.NNN.gz'
where NNN is found by something like:
p4d_1 -r /p4/1/offline_db -k db.counters -jd - | grep '@db.counters@ @journal@'|cut -d '@' -f 8
So, Option 1 would be based on that. The advantage would be a new seed checkpoint taken on the master's offline_db would be the most reliable.
Option 2 would be replaying local (filtered) journals to the local offline_db, and generating local offline checkpoints. That should work, but it's a copy-of-a-copy thing, and might suffer some fidelity loss in certain (rare) situations.
So, I changed this component from 'doc' back to 'core-unix' (though with implied doc changes needed), since this would take a code change to implement, now that I understand the use case more fully. I also changed it form 'Bug' to 'Feature' the current script is intentionally not working for filtered replicas; adding support for that is something new.
Sounds good. I went ahead and updated the backup_functions.sh to support replicas. This has been tested on several servers and works as expected.
https://swarm.workshop.perforce.com/reviews/26875
Thanks!