DédupData
Automated Python script for person deduplication, connecting all data sources of a national health insurance mutual.
National mutual insurance company
A national-scale health insurance mutual with several hundred thousand members, managing data flows from multiple systems: DMS, Cegedim management software, Zendesk support platform, and Azure databases. Over the years, thousands of duplicate person records had accumulated in the information system, created through human error by the management teams.
Management teams were inadvertently creating duplicate person records in the IS. These duplicates had been accumulating for years, scattered across multiple systems with no automated merging capability.
What we observed
Duplicates created by human error
Operators created new records instead of finding existing ones. Name variations, typos, and different address formats.
Siloed data
DMS, Cegedim, Zendesk, Azure database: each system had its own copy of person records, with no synchronization or reconciliation.
No single person view
It was impossible to reconstruct a member's complete history. Information was fragmented across multiple records.
GDPR risk
Duplicates multiplied personal data stored without legitimate purpose, increasing exposure to GDPR non-compliance risk.
Our solution
An automated cross-system merge process
Duplicate listing as input
The process receives as input a listing of persons identified as duplicates. Identification is performed upstream by business teams or the mutual's internal tools.
Multisource connection
Python script simultaneously connected to the DMS, Cegedim, Zendesk, Azure SQL Database, and all of the mutual's databases.
Automated cross-system merging
For each identified duplicate, the script automatically propagates the merge across all connected systems: updating references, consolidating histories, and removing redundant records.
Complete traceability
Each merge operation is logged with a detailed report: source records, target record, impacted systems, and consolidated data. A complete history for audit and compliance.
From a listing of duplicates identified upstream, the Python script connects to all systems and propagates the merge to produce a clean, reconciled database.
From a listing of duplicates identified upstream, the process automatically propagates the merge across all connected systems.
From duplicate listing to the unified database
The process receives as input a listing of persons identified as duplicates by business teams. It then connects to all of the mutual's systems to propagate the merge: updating references, consolidating histories, and removing redundant records.
- Listing ingestion — reading the duplicate file identified upstream, validating the format and identifiers
- Multisource connection — simultaneous access to the DMS, Cegedim, Zendesk, Azure SQL, and all of the mutual's databases
- Cross-system merging — for each duplicate, propagation of the merge across all systems: data consolidation, cross-reference updates
- Traceability and reports — detailed log of each completed merge, summary report for audit and compliance
Listing reception
Ingestion of the duplicate file identified upstream
System connection
Simultaneous access to 5 data sources
Reference resolution
Identification of linked records in each system
Cross-system merging
Data consolidation, removal of redundant records
Report and traceability
Log of each merge, summary report
Before/after comparison of data quality at a national mutual insurance company.
What changes with automated deduplication
What our clients ask us
How does the deduplication process work?
Duplicate identification is performed upstream by business teams or the mutual's internal tools. Our process receives this listing of duplicate persons as input, then connects to all systems (DMS, Cegedim, Zendesk, Azure SQL) to automatically propagate the merge across each data source.
What data sources can you connect?
The Python script connects to any source exposing an API, a SQL database, files (CSV, Excel, XML), or a standard protocol. For this mutual: DMS (REST API), Cegedim (SQL database), Zendesk (REST API), Azure SQL Database, and file extractions from legacy systems.
Is the process GDPR-compliant?
Yes. Deduplication contributes to GDPR compliance by reducing redundant personal data. Processing is carried out on the client's infrastructure (Azure), and data never leaves the mutual's perimeter. A processing register is maintained in accordance with Article 30 of the GDPR.
Can the process be run regularly?
Yes. With each new listing of identified duplicates, the script can be rerun to propagate merges across all systems. The process is designed to be executed on a recurring basis as new duplicates are identified.
How long does it take to set up the system?
The project was completed in 3 months: 1 month analyzing sources and business rules, 1 month developing the multisource merge script, and 1 month of testing, adjustments, and production deployment. The initial cleanup of the existing database was carried out during the acceptance testing phase.
Have you identified a need?
Free 30-minute diagnostic — no commitment, confidential.