CopyPaste.Net: Always on transfer process monitoring

There are a lot of articles and blog posts about Always-on monitoring, I want to give a short description and explanation about the main metrics to monitor.

Metric	Unit	Description	Value on Primary	Value on Secondary	sys.dm_hadr_database_replica_states column
Log Send Queue Size	KB	Amount of log records of the primary database that has not been sent to the secondary databases, in kilobytes (KB).	NULL	Amount of log records of the primary database that has not been sent to the secondary DB.	log_send_queue_size
Redo Queue Size	KB	Amount of log records in the log files of the secondary replica that has not yet been redone, in kilobytes (KB).	NULL	Amount of log records in the log files of the secondary replica that has not yet been redone	redo_queue_size
Last Sent Time	datetime	Time when the last log block was sent.	NULL	The last time that the primary sent a Log Block.	last_sent_time
Last Received Time	datetime	Time when the log block ID in last message received was read on the secondary replica.	NULL	Time when the log block ID in last message received was read.	last_received_time
Last Hardened Time	datetime	On a secondary database, time of the log-block identifier for the last hardened LSN (last_hardened_lsn). On a primary database, reflects the time corresponding to minimum hardened LSN.	NULL	The time of the log-block identifier for the last hardened LSN (last_hardened_lsn).	last_hardened_time
Last Redone Time	datetime	Time when the last log record was redone on the secondary database.	NULL	This is the time that the last LSN was redone on the target database.	last_redone_time
Last Commit Time	datetime	Time corresponding to the last commit record.	The time of the last commit on the primary.	The time of the last commit record was redone and reported back to the primary. → The last commit time on your secondary replica will always be equal to or less than the primary replica.	last_commit_time
Received Latency	HH:MM:SS	The length of time between the time the last log block was sent and the Received Time.	NULL	The length of time between the time the last log block was sent and the Received Time.	IF last_sent_time IS NOT NULL: DATEDIFF(last_sent_time, last_received_time) ELSE NULL
Hardened Latency	HH:MM:SS	The length of time between the time the last log block was sent and the Hardened Time.	NULL	The length of time between the time the last log block was sent and the Hardened Time.	IF last_sent_time IS NOT NULL: DATEDIFF(last_sent_time, last_hardened_time) ELSE NULL
Redone Latency	HH:MM:SS	The length of time between the time the last log block was sent and the Redone Time.	NULL	The length of time between the time the last log block was sent and the Redone Time.	IF last_sent_time IS NOT NULL: DATEDIFF(last_sent_time, last_redone_time) ELSE NULL
Commit Latency	HH:MM:SS	The length of time between the time the last log block was sent and the Commit Time.	NULL	The length of time between the time the last log block was sent and the Commit Time.	IF last_sent_time IS NOT NULL: DATEDIFF(last_sent_time, last_commit_time) ELSE NULL
Transfer data rates (KB/sec)	KB/sec	Transfer data rates for the selected database	NULL	Transfer data rates.	log_send_rate & redo_rate
Queues size (KB)	KB	Queues size for the selected database.	NULL	Logs sizes - see above.	log_send_queue_size & redo_queue_size

Recovery Objectives – RTO & RPO

RTO - Recovery Time Objective

Estimated failover time / Estimated recovery time
How much time until we can get going again?
“It is the objective for the amount of time between the data loss event and recovery.”
We want to minimize the amount of time that we lose.
RTO will count for the longest redo time for all databases in the group.

isnull(datediff(ss,-- the time difference of:

replica.last_commit_time,-- last commit on the replica

primary.last_commit_time)-- last commit on the primary

,0) AS RTO

RPO - Recovery Point Objective

Estimated data loss
How much data we lose?
We want to minimize the amount of data that we lose.

SELECT isnull(

case hars.redo_rate

WHEN 0 THEN 0
ELSE CAST(hars.redo_queue_size AS float) / hars.redo_rate

END ,0) AS PRO

FROM sys.dm_hadr_database_replica_states hars

Another important terms:

SLA - Service-Level Agreement - The SLA should define the RPO and RTO.

Latency - how long it takes for a packet of data to traverse the network from source to destination.

Bandwidth - how much data can be moved in a time interval.

CopyPaste.Net

Labels

Always on transfer process monitoring

No comments:

Post a Comment