Bidev Consulting Logo
About
Case study

DédupData

Automated Python script for person deduplication, connecting all data sources of a national health insurance mutual.

Insurance / Mutual insurer Data quality Python 3 months Anonymized reference
The client

National mutual insurance company

A national-scale health insurance mutual with several hundred thousand members, managing data flows from multiple systems: DMS, Cegedim management software, Zendesk support platform, and Azure databases. Over the years, thousands of duplicate person records had accumulated in the information system, created through human error by the management teams.

5+
Data sources
100K+
Records processed
Python
Merge Script
Azure
Cloud infrastructure
Thousands of duplicates to a unified database

Management teams were inadvertently creating duplicate person records in the IS. These duplicates had been accumulating for years, scattered across multiple systems with no automated merging capability.

What we observed

!

Duplicates created by human error

Operators created new records instead of finding existing ones. Name variations, typos, and different address formats.

!

Siloed data

DMS, Cegedim, Zendesk, Azure database: each system had its own copy of person records, with no synchronization or reconciliation.

!

No single person view

It was impossible to reconstruct a member's complete history. Information was fragmented across multiple records.

!

GDPR risk

Duplicates multiplied personal data stored without legitimate purpose, increasing exposure to GDPR non-compliance risk.

Our solution

An automated cross-system merge process

Duplicate listing as input

The process receives as input a listing of persons identified as duplicates. Identification is performed upstream by business teams or the mutual's internal tools.

Multisource connection

Python script simultaneously connected to the DMS, Cegedim, Zendesk, Azure SQL Database, and all of the mutual's databases.

Automated cross-system merging

For each identified duplicate, the script automatically propagates the merge across all connected systems: updating references, consolidating histories, and removing redundant records.

Complete traceability

Each merge operation is logged with a detailed report: source records, target record, impacted systems, and consolidated data. A complete history for audit and compliance.

-85%
duplicates eliminated
3 months
from scoping to production
5
connected sources
100%
automated
From 5 data sources to a unified database

From a listing of duplicates identified upstream, the Python script connects to all systems and propagates the merge to produce a clean, reconciled database.

GEDDocuments
CegedimHealth management
ZendeskCustomer support
Azure SQLDatabase
OtherInternal sources
▼ ▼ ▼ ▼ ▼
Merge Engine — PythonDuplicate listing as input · Multisource connection · Automated merging · Traceability
▼ ▼ ▼
Unified databaseSingle person view
Merge reportsMerged records · History
Audit logsComplete traceability
Python Python
Microsoft Azure Azure
Zendesk Zendesk
Cegedim Cegedim
How it works the merge

From a listing of duplicates identified upstream, the process automatically propagates the merge across all connected systems.

Automated pipeline

From duplicate listing to the unified database

The process receives as input a listing of persons identified as duplicates by business teams. It then connects to all of the mutual's systems to propagate the merge: updating references, consolidating histories, and removing redundant records.

  • Listing ingestion — reading the duplicate file identified upstream, validating the format and identifiers
  • Multisource connection — simultaneous access to the DMS, Cegedim, Zendesk, Azure SQL, and all of the mutual's databases
  • Cross-system merging — for each duplicate, propagation of the merge across all systems: data consolidation, cross-reference updates
  • Traceability and reports — detailed log of each completed merge, summary report for audit and compliance
1
Listing reception

Ingestion of the duplicate file identified upstream

Intake
2
System connection

Simultaneous access to 5 data sources

Auto
3
Reference resolution

Identification of linked records in each system

ETL
4
Cross-system merging

Data consolidation, removal of redundant records

Auto
5
Report and traceability

Log of each merge, summary report

Audit
Return on investment concrete

Before/after comparison of data quality at a national mutual insurance company.

What changes with automated deduplication

Indicator
Before
With BiDev
Duplicates in the database
Thousands of undetected duplicates
-85% of duplicates eliminated
Manual reconciliation
2-3 days / month per team
Automatic and continuous
Single customer view
Non-existent
Unified cross-source database
GDPR risk
High (duplicated data)
Compliant by design
New duplicate detection
After the fact, random
Real-time, automatic alert

What our clients ask us

How does the deduplication process work?

Duplicate identification is performed upstream by business teams or the mutual's internal tools. Our process receives this listing of duplicate persons as input, then connects to all systems (DMS, Cegedim, Zendesk, Azure SQL) to automatically propagate the merge across each data source.

What data sources can you connect?

The Python script connects to any source exposing an API, a SQL database, files (CSV, Excel, XML), or a standard protocol. For this mutual: DMS (REST API), Cegedim (SQL database), Zendesk (REST API), Azure SQL Database, and file extractions from legacy systems.

Is the process GDPR-compliant?

Yes. Deduplication contributes to GDPR compliance by reducing redundant personal data. Processing is carried out on the client's infrastructure (Azure), and data never leaves the mutual's perimeter. A processing register is maintained in accordance with Article 30 of the GDPR.

Can the process be run regularly?

Yes. With each new listing of identified duplicates, the script can be rerun to propagate merges across all systems. The process is designed to be executed on a recurring basis as new duplicates are identified.

How long does it take to set up the system?

The project was completed in 3 months: 1 month analyzing sources and business rules, 1 month developing the multisource merge script, and 1 month of testing, adjustments, and production deployment. The initial cleanup of the existing database was carried out during the acceptance testing phase.

Have you identified a need?

Free 30-minute diagnostic — no commitment, confidential.