Note how the first 'p4 verify' command succeeds, but the second fails with a P4PASSWD error, like it's somehow losing the ticket.
As reported by the customer, sometimes it does not fail, other times it does; the behavior is NOT consistent. However, what does seem to be consistent is that it only fails on the second command called by the script (the '-qS' pass verifying shelves rather than submitted files), not the first pass doing '-qz'.
The 'shelved' files thing may be a red herring, as a P4PASSWD indicates a ticket authentication error before we know we're dealing with shelved files. But it could be that it's the subsequent call from the same script that is somehow not seeing the P4TICKETS setting?
This line in p4verify.ps1 seems likely to be involved:
It appears to be consistent with which one of the commands it fails on, but it is not consistent with which days it fails or which depot in the list it fails at. We sometimes get fully successful verifies, other times errors as shown.
How long does the initial verify take? And how long is your normal ticket?
Might be just that a login is required before the shelf verify, or at least a check if ticket still valid.
Oh I'm going to feel silly if it's a simple ticket expiration issue. I was assuming that was considered earlier; I'd be surprised if it was overlooked. (I vaguely recall that was considered, as that's the first thing we would have looked at -- this issue has been being explored a while now).
The long-running nature of verifies could still be a factor, I wonder if another possibility could be ticket-stomping (as opposed to ticket expiration) going on with p4login.bat perhaps being called by some other script while the verify is running? That could explain the semi-random nature of occurrence of this issue.
There have been enhancements to the Linux 'p4login' script to avoid ticket stomping that weren't ported to p4login.bat. The p4login.bat always does a 'p4 login' command and generates a new ticket, whether or not it's needed, while 'p4login' only generates a ticket if there's not already a ticket with a long timeout (more than 30 days) available.
In theory this shouldn't matter since a new valid ticket should "just work" anyway, so p4login.bat should be OK. But come to think of it, I haven't heard of any ticket stomping issues in Linux for years, and a few years ago is when I put the logic into p4login to avoid unnecessary ticket generation. A new ticket shouldn't "stomp" an old one, but maybe it somehow does?
We had a successful verify on Sunday that took 5 hours 47 minutes. Our verify is using a super user which is in a group giving unlimited login time. When checking the ticket length of this user on the server doing the verify we see that it has 157714 hours remaining on it's ticket.
It is part of our super users group as well which has the default value of 12 hours for it's login timeout, but my understanding was that users got the longest timeout between the groups they are in. But still, verifies only take 6 hours or so, half of the default 12 hour timeout from our super group if that was at play.
We start our daily checkpoint at 10 pm every day. And then we start the verify at 11 pm each Wednesday, Saturday, and Sunday each week. The checkpoint/daily backup takes about 40 minutes to complete.
Another thing worth mentioning is that after the first error occurs, every depot thereafter in the verify gets the same error for both commands. So once it's tripped, it stays that way for the rest of the verify. I've attached one of the failed logs.
Also, this does not flag the verify as a failure, so we don't get notified in the email subject line.
Curious - is it possible some other command is fiddling with the ticket? Even if it is regenerating ticket via login (becuase as Tom says the p4login.bat doesn't check currently).
Be careful due to potential size of data, but you could use procmon from sysinterals and filter it to only look at events for P4TICKETS file - leave it running over the time the verify runs??
And/or insert a debug statement in the .ps1 to log the result of "p4 login -s" with appropriate paraemters before each shelf check etc.
We will be having a verify task run tonight. I'll see about having procmon monitor our tickets file on the server and see what it returns.
In playing with that a bit, I noticed that python.exe is accessing our tickets file, p4ticket.txt. We have some triggers configured to send a JSON payload to Jenkins instances when submits happen to certain depots and I noticed that this last failure on 01/16/2020 started at 4:28 am. The previous successful verify command was at 4:12 am.
There was a submitted changelist to one of the trigger monitored paths at 4:19 the same day, between the last successful verify command and the first failure. I don't have Helix logs to prove it ran this trigger, but the submitted changelist was on one of the paths one of our triggers monitors, so I'm assuming it did run. We'd turned down our logging because the file sizes were getting out of hand.
This python script just sends a webhook and doesn't modify the file though. I've attached it for perusal though.
It just logs in to get some information about the changelist that was submitted to know if we should send the webhook and then shoots off the webhook. Minimal interaction with Helix in here.
We wanted to report back that we found that the failures were consistently occurring when this trigger runs. We modified the script to not log in and just assume it will have a ticket because it's using a user which has no timeout. We are no longer getting the error we were before.
Our question is though, why weren't we having this problem on the previous version of the SDP?
Just an FYI for @robert_cowham.
It appears to be consistent with which one of the commands it fails on, but it is not consistent with which days it fails or which depot in the list it fails at. We sometimes get fully successful verifies, other times errors as shown.
How long does the initial verify take? And how long is your normal ticket?
Might be just that a login is required before the shelf verify, or at least a check if ticket still valid.
Oh I'm going to feel silly if it's a simple ticket expiration issue. I was assuming that was considered earlier; I'd be surprised if it was overlooked. (I vaguely recall that was considered, as that's the first thing we would have looked at -- this issue has been being explored a while now).
The long-running nature of verifies could still be a factor, I wonder if another possibility could be ticket-stomping (as opposed to ticket expiration) going on with p4login.bat perhaps being called by some other script while the verify is running? That could explain the semi-random nature of occurrence of this issue.
There have been enhancements to the Linux 'p4login' script to avoid ticket stomping that weren't ported to p4login.bat. The p4login.bat always does a 'p4 login' command and generates a new ticket, whether or not it's needed, while 'p4login' only generates a ticket if there's not already a ticket with a long timeout (more than 30 days) available.
In theory this shouldn't matter since a new valid ticket should "just work" anyway, so p4login.bat should be OK. But come to think of it, I haven't heard of any ticket stomping issues in Linux for years, and a few years ago is when I put the logic into p4login to avoid unnecessary ticket generation. A new ticket shouldn't "stomp" an old one, but maybe it somehow does?
We had a successful verify on Sunday that took 5 hours 47 minutes. Our verify is using a super user which is in a group giving unlimited login time. When checking the ticket length of this user on the server doing the verify we see that it has 157714 hours remaining on it's ticket.
It is part of our super users group as well which has the default value of 12 hours for it's login timeout, but my understanding was that users got the longest timeout between the groups they are in. But still, verifies only take 6 hours or so, half of the default 12 hour timeout from our super group if that was at play.
We start our daily checkpoint at 10 pm every day. And then we start the verify at 11 pm each Wednesday, Saturday, and Sunday each week. The checkpoint/daily backup takes about 40 minutes to complete.
Another thing worth mentioning is that after the first error occurs, every depot thereafter in the verify gets the same error for both commands. So once it's tripped, it stays that way for the rest of the verify. I've attached one of the failed logs.
Also, this does not flag the verify as a failure, so we don't get notified in the email subject line.
Curious - is it possible some other command is fiddling with the ticket? Even if it is regenerating ticket via login (becuase as Tom says the p4login.bat doesn't check currently).
Be careful due to potential size of data, but you could use procmon from sysinterals and filter it to only look at events for P4TICKETS file - leave it running over the time the verify runs??
And/or insert a debug statement in the .ps1 to log the result of "p4 login -s" with appropriate paraemters before each shelf check etc.
We will be having a verify task run tonight. I'll see about having procmon monitor our tickets file on the server and see what it returns.
In playing with that a bit, I noticed that python.exe is accessing our tickets file, p4ticket.txt. We have some triggers configured to send a JSON payload to Jenkins instances when submits happen to certain depots and I noticed that this last failure on 01/16/2020 started at 4:28 am. The previous successful verify command was at 4:12 am.
There was a submitted changelist to one of the trigger monitored paths at 4:19 the same day, between the last successful verify command and the first failure. I don't have Helix logs to prove it ran this trigger, but the submitted changelist was on one of the paths one of our triggers monitors, so I'm assuming it did run. We'd turned down our logging because the file sizes were getting out of hand.
This python script just sends a webhook and doesn't modify the file though. I've attached it for perusal though.
It just logs in to get some information about the changelist that was submitted to know if we should send the webhook and then shoots off the webhook. Minimal interaction with Helix in here.
We wanted to report back that we found that the failures were consistently occurring when this trigger runs. We modified the script to not log in and just assume it will have a ticket because it's using a user which has no timeout. We are no longer getting the error we were before.
Our question is though, why weren't we having this problem on the previous version of the SDP?