SDP-341 #10

  • //
  • spec/
  • job/
  • SDP-341
  • View
  • Commits
  • Open Download .zip Download (8 KB)
# The form data below was edited by tom_tyler
# Perforce Workshop Jobs
#
#  Job:           The job name. 'new' generates a sequenced job number.
#
#  Status:        Job status; required field.  There is no enforced or
#                 promoted workflow for transition of jobs from one
#                 status to another, just a set of job status values
#                 for users to apply as they see fit.  Possible values:
#
#                 open - Issue is available to be worked on.
#
#                 inprogress - Active development is in progress.
#
#                 blocked - Issue cannot be implemented for some reason.
#
#                 fixed - Fixed, optional status to use before closed.
#                 
#                 closed - Issue has been dealt with definitively.
#
#                 punted - Decision made not to address the issue,
#                    possibly not ever.
#
#                 suspended - Decision made not to address the issue
#                    in the immediate future, but noting that it may
#                    have some merit and may be revisited later.
#
#                 duplicate - Duplicate of another issue that.
#
#                 obsolete - The need behind the request has become
#                    overcome by events.
#
#  Project:       The project this job is for. Required.
#
#  Severity:      [A/B/C] (A is highest)  Required.
#
#  ReportedBy     The user who created the job. Can be changed.
#
#  ReportedDate:  The date the job was created.  Automatic.
#
#  ModifiedBy:    The user who last modified this job. Automatic.
#
#  ModifiedDate:  The date this job was last modified. Automatic.
#
#  OwnedBy:       The owner, responsible for doing the job. Optional.
#
#  Description:   Description of the job.  Required.
#
#  DevNotes:      Developer's comments.  Optional.  Can be used to
#                 explain a status, e.g. for blocked, punted,
#                 obsolete or duplicate jobs.  May also provide
#                 additional information such as the earliest release
#                 in which a bug is known to exist.
#
# Component:      Projects may use this optional field to indicate
#                 which component of the project a givenjob is associated
#                 with.
#
#                 For the SDP, the list of components is defined in:
#                 //guest/perforce_software/sdp/tools/components.txt
#
#  Type:          Type of job [Bug/Feature].  Required.
#
#  Release:       Release in which job is intended to be fixed.

Job:	SDP-341

Status:	closed

Project:	perforce-software-sdp

Severity:	A

ReportedBy:	cgeen

ReportedDate:	2018/07/10 15:03:11

ModifiedBy:	tom_tyler

ModifiedDate:	2019/01/23 08:28:29

OwnedBy:	tom_tyler

Description:
	Critical recreate_db_checkpoint.sh bug with shared /hxdepots shared.
	
	This bug won't impact many customers due to it involving an unlikely
	sequence of events. But the impact is high if it hits, and addressing
	the issue is critical.
	
	== Background ==
	
	This bug has been in the SDP since 2016.2.21193 (December 2, 2016),
	and affects versions between 2016.2.21193 (December 2, 2016) and
	2018.1.23583 (2018/02/08), inclusive.  Versions older or newer are
	unaffected, and the Windows SDP is unaffected.
	
	The issue is in script the recreate_db_checkpoint.sh.  The default SDP
	crontab calls this script only twice a year, and is sometimes disabled
	entirely as it is optional.  The script replaces live databases in
	P4ROOT with fresh, regenerated-from-a-checkpoint databases from the
	offline_db tree maintained by the SDP.
	
	The default crontab calls recreate_db_checkpoint.sh twice per year, on
	the first Saturday in January and July at 6:05 PM on the master
	server's time zone.
	
	The issue only occurs when the following are true:
	* The storage volume used for archive files is shared (e.g. via NFS
	or SAN) across a master and its HA server.
	* A failover from the master server to the HA replica has been done.
	* The recreate_db_checkpoint.sh script runs accidentally on the
	out-of-commission master (e.g. via a cron that everyone forgot about
	still running on the old master).
	
	The negative impact occurs after a failover-then-failback situation,
	when the script is run on the old master, but (due to shared storage)
	rotates database symlinks on the new master server machine.
	
	It is not likely to hit many customers, but when it does, the impact
	is an outage and needing to recover from a checkpoint and journal.
	(Luckily, those are always available with the SDP).
	
	=== A QUICK FIX ===
	
	Customers should DELETE these two scripts from the installation:
	
	/p4/common/bin/recreate_db_checkpoint.sh
	/p4/common/bin/recreate_db_sync_replica.sh
	
	Then remove any calls to these two scripts to it in the crontab of the
	OS account under which Perforce runs on any and all Perforce server
	machines.  This OS account is typically 'perforce' or 'p4admin'.
	
	If you are not comfortable with the SDP, this is a fast, safe, easy
	fix.  It only requires login access to the machine and OS file
	permissions sufficient to delete the scripts.  It can be applied
	immediately by anyone with login access to the Perforce server machine.
	It does NOT require an SDP update.
	
	After making this change, the HA replica that shares archvies with
	its master server must be reseeded from the latest checkpoint on
	the master.
	
	This quick fix will remove the capability to occasionally replace live
	databases with fresh ones regenerated from a checkpoint.  That
	functionality is non-critical to most customers.
	
	=== THE QUICK SDP PATCH ===
	
	A quick SDP patch has been be release that simply deletes this script
	and references to it in the crontab and documentation.  (A fixed
	version of the script will likely re-appear in a future release).
	
	=== A BETTER, MORE SOPHISTICATED FIX ===
	
	For customers who want to preserve the capability to routinely replace
	live databases with fresh ones regenerated from a checkpoint, a
	workaround can be done by making a change to the SDP structure rather
	than deleting the recreate_db_checkpiont.sh script.
	
	The solution outlined below has been proven to work.  If you are
	comfortable with the SDP, this is the best fix.
	
	Details:  Since the early days of the SDP in 2007, it has been
	structured so that the /p4 directory was on the root volume (/), and
	the individual SDP instance-specific directories, e.g. /p4/1, were on
	the storage volume used for archive files (often named /hxdepots or
	/depotdata, but can be different at any given customer site).  The
	instance-specific directories contained a mix of regular directories
	(for things stored on the archive files volume) and symlinks.
	
	To fix this issue, restructure it so that the /p4 directoryand
	instance-specific directories like /p4/1 are ALL on the root
	volume (/).  The instance-specific directories contains only
	symlinks and .p4tickets/.p4trust files in this structure.
	
	This fix can be applied manually, and does not require an SDP
	upgrade.  Further, it will work with future versions of the
	SDP, as this structural change to the symlink and directory
	structure of /p4 and /p4/N directories was on track to be
	included in a future release of the SDP for performance reasons
	prior to detection of this bug.  (The performance benefit is
	ensuring that access to latency-sensitive /p4/N/root does
	not pay a high latency tax going thru a /p4/N symlink on a
	shared storage volume).
	
	=== FUTURE FIX ===
	
	A future SDP release will provide a fix that preserves the
	capability to routinely replace live databases with fresh ones
	regenerated from a checkpoint.  Customers will need to update to
	the latest SDP to get the new version when it is available.

Component:	core-unix

Type:	Bug
# Change User Description Committed
#10 default
#9 default
#8 default
#7 default
#6 default
#5 default
#4 default
#3 default
#2 default
#1 default