Labels

admin (1) aix (1) alert (1) always-on (2) Architecture (1) aws (3) Azure (1) backup (3) BI-DWH (10) Binary (3) Boolean (1) C# (1) cache (1) casting (3) cdc (1) certificate (1) checks (1) cloud (3) cluster (1) cmd (7) collation (1) columns (1) compilation (1) configurations (7) Connection-String (2) connections (6) constraint (6) copypaste (2) cpu (2) csv (3) CTE (1) data-types (1) datetime (23) db (547) DB2 (1) deadlock (2) Denali (7) device (6) dotNet (5) dynamicSQL (11) email (5) encoding (1) encryption (4) errors (124) excel (1) ExecutionPlan (10) extended events (1) files (7) FIPS (1) foreign key (1) fragmentation (1) functions (1) GCP (2) gMSA (2) google (2) HADR (1) hashing (3) in-memory (1) index (3) indexedViews (2) insert (3) install (10) IO (1) isql (6) javascript (1) jobs (11) join (2) LDAP (2) LinkedServers (8) Linux (15) log (6) login (1) maintenance (3) mariadb (1) memory (4) merge (3) monitoring (4) MSA (2) mssql (444) mssql2005 (5) mssql2008R2 (20) mssql2012 (2) mysql (36) MySQL Shell (5) network (1) NoSQL (1) null (2) numeric (9) object-oriented (1) offline (1) openssl (1) Operating System (4) oracle (7) ORDBMS (1) ordering (2) Outer Apply (1) Outlook (1) page (1) parameters (2) partition (1) password (1) Performance (103) permissions (10) pivot (3) PLE (1) port (4) PostgreSQL (14) profiler (1) RDS (3) read (1) Replication (12) restore (4) root (1) RPO (1) RTO (1) SAP ASE (48) SAP RS (20) SCC (4) scema (1) script (8) security (10) segment (1) server (1) service broker (2) services (4) settings (75) SQL (74) SSAS (1) SSIS (19) SSL (8) SSMS (4) SSRS (6) storage (1) String (35) sybase (57) telnet (2) tempdb (1) Theory (2) tips (120) tools (3) training (1) transaction (6) trigger (2) Tuple (2) TVP (1) unix (8) users (3) vb.net (4) versioning (1) windows (14) xml (10) XSD (1) zip (1)

Always on transfer process monitoring

There are a lot of articles and blog posts about Always-on monitoring, I want to give a short description and explanation about the main metrics to monitor.


Metric

Unit

Description

Value on Primary

Value on Secondary

sys.dm_hadr_database_replica_states column

 

Log Send Queue Size

KB

Amount of log records of the primary database that has not been sent to the secondary databases, in kilobytes (KB).

NULL

Amount of log records of the primary database that has not been sent to the secondary DB.

log_send_queue_size

Redo Queue Size

KB

Amount of log records in the log files of the secondary replica that has not yet been redone, in kilobytes (KB).

NULL

Amount of log records in the log files of the secondary replica that has not yet been redone

redo_queue_size

Last Sent Time

datetime

Time when the last log block was sent.

NULL

The last time that the primary sent a Log Block.

last_sent_time

Last Received Time

datetime

Time when the log block ID in last message received was read on the secondary replica.

NULL

Time when the log block ID in last message received was read.

last_received_time

Last Hardened Time

datetime

On a secondary database, time of the log-block identifier for the last hardened LSN (last_hardened_lsn). On a primary database, reflects the time corresponding to minimum hardened LSN.

NULL

The time of the log-block identifier for the last hardened LSN (last_hardened_lsn).

last_hardened_time

Last Redone Time

datetime

Time when the last log record was redone on the secondary database.

NULL

This is the time that the last LSN was redone on the target database.

last_redone_time

Last Commit Time

datetime

Time corresponding to the last commit record.

The time of the last commit on the primary.

The time of the last commit record was redone and reported back to the primary.

→ The last commit time on your secondary replica will always be equal to or less than the primary replica.

last_commit_time

Received Latency

HH:MM:SS

The length of time between the time the last log block was sent and the Received Time.

NULL

The length of time between the time the last log block was sent and the Received Time.

IF last_sent_time IS NOT NULL:

DATEDIFF(last_sent_time, last_received_time)

ELSE

NULL

Hardened Latency

HH:MM:SS

The length of time between the time the last log block was sent and the Hardened Time.

NULL

The length of time between the time the last log block was sent and the Hardened Time.

IF last_sent_time IS NOT NULL:

DATEDIFF(last_sent_time, last_hardened_time)

ELSE

NULL

Redone Latency

HH:MM:SS

The length of time between the time the last log block was sent and the Redone Time.

NULL

The length of time between the time the last log block was sent and the Redone Time.

IF last_sent_time IS NOT NULL:

DATEDIFF(last_sent_time, last_redone_time)

ELSE

NULL

Commit Latency

HH:MM:SS

The length of time between the time the last log block was sent and the Commit Time.

NULL

The length of time between the time the last log block was sent and the Commit Time.

IF last_sent_time IS NOT NULL:

DATEDIFF(last_sent_time, last_commit_time)

ELSE

NULL

Transfer data rates (KB/sec)

KB/sec

Transfer data rates for the selected database

NULL

Transfer data rates.

log_send_rate

&

redo_rate

Queues size (KB)

KB

Queues size for the selected database.

NULL

Logs sizes - see above.

log_send_queue_size

&

redo_queue_size


 Recovery Objectives – RTO & RPO

RTO - Recovery Time Objective

  • Estimated failover time / Estimated recovery time
  • How much time until we can get going again?
  • “It is the objective for the amount of time between the data loss event and recovery.”
  • We want to minimize the amount of time that we lose.
  • RTO will count for the longest redo time for all databases in the group.
isnull(datediff(ss,-- the time difference of:
replica.last_commit_time,-- last commit on the replica
primary.last_commit_time)-- last commit on the primary
,0) AS RTO

RPO - Recovery Point Objective

  • Estimated data loss
  • How much data we lose?
  • We want to minimize the amount of data that we lose.
SELECT isnull(
case hars.redo_rate 
WHEN 0 THEN 0 
ELSE CAST(hars.redo_queue_size AS float) / hars.redo_rate 
END ,0) AS PRO
FROM sys.dm_hadr_database_replica_states hars




Another important terms:

SLA - Service-Level Agreement - The SLA should define the RPO and RTO.

Latency - how long it takes for a packet of data to traverse the network from source to destination.

Bandwidth - how much data can be moved in a time interval.