SDP-302

Details
Comments 6

Status: Closed
Project: perforce-software-sdp
Severity: C
Reported By: akwan
Reported Date: 7 years ago
Modified By: C. Thomas Tyler
Modified Date: 2 years ago
Owned By: tom_tyler
Dev Notes: [2023/04/13 tom_tyler]: This job is now closed.
Parallel checkpoints
are now fully supported. The needed p4d features to ensure reliable
processing have been released, and the SDP now takes advantage of
them. See notes about DO_PARALLEL_CHECKPOINTS in the Instance Vars
file (e.g. /p4/common/config/p4_1.vars) for more info.

[2021/07/06 tom_tyler]: This job has been suspended. Turns out some
needed p4d support (a command to get a list of checkpointed tables)
isn't available. Also, there is hope that a future release of p4d will
provide this capability without the need for scripting.

While there are implementations of the parallel checkpoint mechanism
that have been made to work (by checkpointing all tables whether they
need it or not), this is the sort of thing that can never fail. We
decided this feature, while it would be valuable, is best done as a
p4d feature rather than an SDP feature. When the needed functionality
is added to p4d, this job will be re-opened.

[2020/08/18 tom_tyler]: Re-opening this job to re-add this feature,
with full test suite coverage.

Older Notes:

This can be done reliably, but will be sophisticated. We may want
to add an optional new setting in instance_vars.template, e.g.
PARALLEL_CHECKPOINTS with a default value of 0.

Then either dump_checkpoint() or dump_checkpoint_parallel() would be
called depending on whether that new var is set to 1 or not. So
by default it would still do single-threaded checkpoints, and
would do parallel checkpoints if explicitly enabled.
Component: core-unix
Type: Feature

	Flag as Task akwan commented 7 years ago (edited) Testing Robert's suggestion of using `pigz`. Pretty neat utility. Trying that approach where I do a normal `jd` and pipe it through `pigz` for the large database files, and bundle the remaining ones. (whattttttt this Swarm doesn't support markdown in comments? 😑 )
	Reply ·0

	Flag as Task Russell C. Jackson (Rusty) commented 6 years ago I can't recommend this solution since checkpoints are run on the master, and you will hog too many of the server's CPUs.
	Reply ·0

	Flag as Task C. Thomas Tyler commented 6 years ago Good point. The proposed logic includes determining the CPU to engage and using as many as possible, essentially assuming that "getting it over with" is the highest priority compared to "not significantly impacting performance while checkpoints occur." Both are reasonable priorities. We'll want to add throttle control set the minimum number of CPUs to keep available for other processing, to avoid slamming the CPU availability too hard. Given the high value of getting checkpoint duration reduced, I think most admins will tolerate some impact (compared to the current single-threaded behavior), so long as it's not blocking/locking or exhausting resources. But we'll need to make it super easy to "go back to the old way." Or perhaps make it so a configuration change from the default is needed to enable parallel checkpoint processing? (I'd be inclined to make it the default, though, since we're only going to release it once fully proven.
	Reply ·0

	Flag as Task rwillyoung commented 6 years ago I'm going to bump here to show some serious interest in this (on windows). Our checkpoints just take far too long to complete. I'd like to note that we take our checkpoints on read only replicas. In a federated environment I care a little less about my server being hammered hard than my ability to take frequent backups that let me restore to a closer point in time to any failure we have.
	Reply ·0

	Flag as Task akwan commented 6 years ago Last time I worked on this during a hackathon, I discovered that a checkpointing process invokes a lock on the database itself, so other checkpoint processes couldn't run, even if you specified a single individual db.file. The only way I'm aware of working around this would involve symlinking the database files from discrete locations as described by google: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/39983.pdf With the right kind of hardware, doing a jd without -z piped through pigz did make checkpointing individual databases quicker and allowed the system to take advantage of spare resource (noted that even with a 16core box I never saw pigz reliably gobble more than 5-6 cores). pigz is capable of constraining itself on how much its allowed to consume. While parallel checkpoint creation wasn't particularly quick, having multiple checkpoint files would allow for a way to parallel restore said checkpoint. For us personally, we've moved full checkpointing to a replica, process journal rotations on the commit, and do database rebalancing on commit on the weekends.
	Reply ·0

	Flag as Task Russell C. Jackson (Rusty) commented 6 years ago That is correct Alan, and that is the reason we use the offline database for doing the checkpoint.
	Reply ·0

24769	New script for performing a parallel checkpoint. Run as follows: ...
25374	New script for performing a parallel checkpoint. Run as follows: ...
24768	New script for performing a parallel checkpoint. Run as follows: ...

jobs/SDP-302

SDP-302