Data Profiler
Produce a complete data quality assessment for the source system data. The output is a data quality scorecard that drives cleansing priorities and determines which data issues must be fixed before migration and which can be resolved during transformation. No migration should proceed to production cutover without a completed data profile.
Profiling Queries
Run these queries against the source system. Document the results for every entity being migrated.
Completeness Analysis
-- Null rate per column (SQL Server)
SELECT
'Clients' AS TableName,
SUM(CASE WHEN ClientID IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS ClientID_NullPct,
SUM(CASE WHEN LastName IS NULL OR LastName = '' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS LastName_NullPct,
SUM(CASE WHEN FirstName IS NULL OR FirstName = '' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS FirstName_NullPct,
SUM(CASE WHEN SSN IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS SSN_NullPct,
SUM(CASE WHEN DateOfBirth IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS DateOfBirth_NullPct,
SUM(CASE WHEN AgentCode IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS AgentCode_NullPct,
COUNT(*) AS TotalRecords
FROM dbo.Clients;
Null rate interpretation:
| Null Rate | Severity | Action |
|---|---|---|
| 0% | None | No action needed |
| 1-5% | Low | Accept if field is optional in destination; investigate if required |
| 5-20% | Medium | Identify pattern (batch of records from specific period?), cleanse or default |
| > 20% | High | Must resolve before migration — either cleanse, derive, or confirm field is not required |
Document results:
| Table | Column | Null / Empty Count | Null % | Destination Required | Action |
|---|---|---|---|---|---|
| Clients | LastName | 0 | 0% | Yes | No action |
| Clients | DateOfBirth | 1,245 | 10% | No | Accept null — destination field is optional |
| Clients | AgentCode | 340 | 2.7% | No | Set ProducerId to null in destination |
| Clients | SSN | 8,200 | 66% | No (encrypted, optional) | Accept null |
Uniqueness Analysis
-- Duplicate detection on natural key fields
SELECT
PolicyNumber,
COUNT(*) AS DuplicateCount
FROM dbo.Policies
GROUP BY PolicyNumber
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;
-- Total duplicates
SELECT COUNT(*) AS TotalDuplicateGroups
FROM (
SELECT PolicyNumber, COUNT(*) cnt
FROM dbo.Policies
GROUP BY PolicyNumber
HAVING COUNT(*) > 1
) d;
-- Sample of duplicate records (to understand the pattern)
SELECT TOP 20 p.*
FROM dbo.Policies p
WHERE p.PolicyNumber IN (
SELECT PolicyNumber FROM dbo.Policies GROUP BY PolicyNumber HAVING COUNT(*) > 1
)
ORDER BY p.PolicyNumber, p.CreateDate;
Document:
| Table | Key Field | Total Records | Duplicate Groups | Records in Duplicate Groups | Duplicate % |
|---|---|---|---|---|---|
| Policies | PolicyNumber | 38,200 | 42 | 89 | 0.23% |
| Clients | SSN (non-null) | 4,085 | 7 | 16 | 0.39% |
De-duplication strategy (choose based on the pattern found):
| Strategy | When to Use |
|---|---|
| Keep latest by date | Duplicates are updates that did not properly replace the original |
| Keep highest status | Active > Pending > Cancelled — the more complete record wins |
| Merge fields | Both records have partial data that completes the full record — merge fields |
| Manual review | Duplicates cannot be programmatically resolved — flag for business review |
Format Consistency
-- Date format distribution (for varchar date columns)
SELECT
CASE
WHEN EffectiveDate LIKE '[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]' THEN 'MM/DD/YYYY'
WHEN EffectiveDate LIKE '[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]' THEN 'YYYY-MM-DD'
WHEN EffectiveDate IS NULL THEN 'NULL'
ELSE 'OTHER: ' + LEFT(EffectiveDate, 20)
END AS FormatPattern,
COUNT(*) AS RecordCount
FROM dbo.OldPolicies -- tables with varchar date columns
GROUP BY
CASE
WHEN EffectiveDate LIKE '[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]' THEN 'MM/DD/YYYY'
WHEN EffectiveDate LIKE '[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]' THEN 'YYYY-MM-DD'
WHEN EffectiveDate IS NULL THEN 'NULL'
ELSE 'OTHER: ' + LEFT(EffectiveDate, 20)
END;
-- Phone number format distribution
SELECT
CASE
WHEN Phone LIKE '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]' THEN 'NNN-NNN-NNNN'
WHEN Phone LIKE '([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]' THEN '(NNN) NNN-NNNN'
WHEN Phone LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' THEN 'NNNNNNNNNN'
WHEN Phone IS NULL THEN 'NULL'
ELSE 'OTHER'
END AS FormatPattern,
COUNT(*) AS RecordCount
FROM dbo.Clients
GROUP BY
CASE
WHEN Phone LIKE '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]' THEN 'NNN-NNN-NNNN'
WHEN Phone LIKE '([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]' THEN '(NNN) NNN-NNNN'
WHEN Phone LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' THEN 'NNNNNNNNNN'
WHEN Phone IS NULL THEN 'NULL'
ELSE 'OTHER'
END;
Document the findings and the normalization rule:
| Field | Formats Found | Count per Format | Normalization Rule |
|---|---|---|---|
| EffectiveDate | MM/DD/YYYY (35,100), YYYY-MM-DD (2,800), NULL (300) | — | Parse all formats, output ISO 8601 YYYY-MM-DD |
| Phone | NNN-NNN-NNNN (8,200), (NNN) NNN-NNNN (1,800), NNNNNNNNNN (950), NULL (1,500) | — | Strip all non-digits, store as 10-digit string |
SSN masking verification:
-- Verify SSN data is not exposed in other columns
SELECT TOP 10 Notes FROM dbo.Clients WHERE Notes LIKE '%[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]%';
-- If any rows returned: SSN embedded in notes — must be scrubbed before migration
Referential Integrity
-- Orphaned policies (no matching client)
SELECT
COUNT(*) AS OrphanedPolicies,
COUNT(*) * 100.0 / (SELECT COUNT(*) FROM dbo.Policies) AS OrphanPct
FROM dbo.Policies p
WHERE NOT EXISTS (SELECT 1 FROM dbo.Clients c WHERE c.ClientID = p.ClientID);
-- Show sample orphaned records
SELECT TOP 20 p.PolicyNumber, p.ClientID, p.CreateDate
FROM dbo.Policies p
WHERE NOT EXISTS (SELECT 1 FROM dbo.Clients c WHERE c.ClientID = p.ClientID)
ORDER BY p.CreateDate DESC;
-- Claims with no matching policy
SELECT COUNT(*) AS OrphanedClaims
FROM dbo.Claims cl
WHERE NOT EXISTS (SELECT 1 FROM dbo.Policies p WHERE p.PolicyID = cl.PolicyID);
Document:
| Parent Table | Child Table | Orphan Count | Orphan % | Recommended Action |
|---|---|---|---|---|
| Clients | Policies | 12 | 0.03% | Investigate — these are old records; likely client deleted without cascade. Exclude from migration or create placeholder client records. |
| Policies | Claims | 0 | 0% | No action |
| Agents | Policies | 7 | 0.02% | Agent left firm; create inactive placeholder agent record |
Value Distribution
For code fields (status codes, LOB codes, state codes), profile all distinct values:
-- Status code distribution
SELECT StatusCode, COUNT(*) AS RecordCount, COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() AS Pct
FROM dbo.Policies
GROUP BY StatusCode
ORDER BY RecordCount DESC;
Document and flag any codes not in the expected value set:
| Code Field | Value | Count | % | In Destination Lookup? |
|---|---|---|---|---|
| StatusCode | A | 28,450 | 74.5% | Yes → Active |
| StatusCode | C | 8,100 | 21.2% | Yes → Cancelled |
| StatusCode | X | 1,550 | 4.1% | Yes → Expired |
| StatusCode | P | 98 | 0.3% | Yes → Pending |
| StatusCode | Z | 2 | 0.01% | No — unknown code. Must investigate before migration. |
Unknown codes require business owner confirmation: What does Z mean? Map to which destination value? Or exclude from migration?
Data Volume by Entity
-- Record counts and approximate data volume
SELECT
t.name AS TableName,
SUM(p.rows) AS RowCount,
SUM(a.total_pages) * 8 / 1024.0 AS TotalMB
FROM sys.tables t
JOIN sys.indexes i ON t.object_id = i.object_id
JOIN sys.partitions p ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.allocation_units a ON p.partition_id = a.container_id
WHERE t.type = 'U'
GROUP BY t.name
ORDER BY TotalMB DESC;
| Table | Row Count | Size (MB) | Est. Migration Time at 500 rows/sec | Priority |
|---|---|---|---|---|
| Policies | 38,200 | 145 MB | 76 seconds | 2 (after Clients) |
| Attachments | 28,000 files | 8.2 GB | File transfer will dominate — est. 2 hours | 3 |
| Clients | 12,450 | 42 MB | 25 seconds | 1 (first — referenced by Policies) |
| Claims | 4,100 | 18 MB | 8 seconds | 4 |
| Agents | 85 | < 1 MB | < 1 second | 0 (first — referenced by Clients) |
Critical Data Quality Issues
Identify fields that will fail destination validation without cleansing:
| Issue | Table | Field | Record Count | Severity | Must Fix Before Cutover |
|---|---|---|---|---|---|
| Blank LastName | Clients | LastName | 3 | Critical | Yes — destination requires |
| Unknown StatusCode 'Z' | Policies | StatusCode | 2 | High | Yes — no destination mapping |
| EffectiveDate in future | Policies | EffectiveDate | 890 | Low | No — valid for future policies |
| SSN in Notes field | Clients | Notes | 14 | Critical | Yes — PII in free text; scrub before migration |
| Orphaned policies | Policies | ClientID | 12 | Medium | Investigate — exclude or fix |
Data Quality Scorecard
Summarize findings per entity on a 0-100 score:
Scoring method:
- Completeness: (1 - average null rate for required fields) × 40 points
- Uniqueness: (1 - duplicate rate on natural key) × 30 points
- Referential integrity: (1 - orphan rate) × 20 points
- Format consistency: (% of records in standard format for date/phone) × 10 points
| Entity | Completeness | Uniqueness | Ref. Integrity | Format | Total Score | Grade |
|---|---|---|---|---|---|---|
| Clients | 38/40 | 29/30 | 20/20 | 9/10 | 96 | A |
| Policies | 36/40 | 28/30 | 19/20 | 7/10 | 90 | A- |
| Claims | 39/40 | 30/30 | 20/20 | 10/10 | 99 | A+ |
| Agents | 40/40 | 30/30 | 20/20 | 10/10 | 100 | A+ |
Migration readiness:
- Score >= 90: Ready for migration with minor cleansing
- Score 75-89: Cleansing required before migration; estimate 1-2 weeks of data prep
- Score < 75: Significant data quality work required; estimate 2-4 weeks; may require business involvement
Output Format
Deliver as:
- Profiling query results summary (one section per analysis type)
- Completeness table (null rates per column per entity)
- Uniqueness findings (duplicate counts and de-duplication strategy)
- Format consistency findings and normalization rules
- Referential integrity findings and orphan handling strategy
- Value distribution table for all code fields (with unknown values flagged)
- Data volume and migration time estimates
- Critical issues list (ranked by severity — must resolve before cutover)
- Data quality scorecard per entity
- Cleansing work estimate (number of records and effort to address each issue)
- Recommended cleansing sequence (order of operations to prepare data for migration)