Comment fonctionne le processus de dédoublonnage ?

L'identification des doublons est réalisée en amont par les équipes métier. Notre traitement reçoit ce listing en entrée puis propage automatiquement la fusion dans l'ensemble des systèmes connectés.

Quelles sources de données pouvez-vous connecter ?

Le moteur Python se connecte à toute source exposant une API, une base de données SQL, des fichiers CSV/Excel/XML ou un protocole standard. Pour cette mutuelle : GED, Cegedim, Zendesk, Azure SQL Database.

Le traitement est-il conforme RGPD ?

Oui. Le dédoublonnage contribue à la conformité RGPD en réduisant les données personnelles redondantes. Le traitement est réalisé sur l'infrastructure du client, les données ne quittent jamais le périmètre.

Le traitement peut-il être relancé régulièrement ?

Oui. À chaque nouveau listing de doublons identifiés, le script peut être relancé pour propager les fusions dans tous les systèmes.

Combien de temps pour mettre en place le système ?

Le projet a été réalisé en 3 mois : 1 mois d'analyse des sources et règles métier, 1 mois de développement du script de fusion multisources, 1 mois de tests et mise en production.

Case study

DédupData

Automated Python script for person deduplication, connecting all data sources of a national health insurance mutual.

Insurance / Mutual insurer Data quality Python 3 months Anonymized reference

The client

National mutual insurance company

A national-scale health insurance mutual with several hundred thousand members, managing data flows from multiple systems: DMS, Cegedim management software, Zendesk support platform, and Azure databases. Over the years, thousands of duplicate person records had accumulated in the information system, created through human error by the management teams.

Data sources

100K+

Records processed

Python

Merge Script

Azure

Cloud infrastructure

Field observations

Thousands of duplicates to a unified database

Management teams were inadvertently creating duplicate person records in the IS. These duplicates had been accumulating for years, scattered across multiple systems with no automated merging capability.

What we observed

Duplicates created by human error

Operators created new records instead of finding existing ones. Name variations, typos, and different address formats.

Siloed data

DMS, Cegedim, Zendesk, Azure database: each system had its own copy of person records, with no synchronization or reconciliation.

No single person view

It was impossible to reconstruct a member's complete history. Information was fragmented across multiple records.

GDPR risk

Duplicates multiplied personal data stored without legitimate purpose, increasing exposure to GDPR non-compliance risk.

Our solution

An automated cross-system merge process

Duplicate listing as input

The process receives as input a listing of persons identified as duplicates. Identification is performed upstream by business teams or the mutual's internal tools.

Multisource connection

Python script simultaneously connected to the DMS, Cegedim, Zendesk, Azure SQL Database, and all of the mutual's databases.

Automated cross-system merging

For each identified duplicate, the script automatically propagates the merge across all connected systems: updating references, consolidating histories, and removing redundant records.

Complete traceability

Each merge operation is logged with a detailed report: source records, target record, impacted systems, and consolidated data. A complete history for audit and compliance.

-85%

duplicates eliminated

3 months

from scoping to production

connected sources

100%

automated

Architecture

From 5 data sources to a unified database

From a listing of duplicates identified upstream, the Python script connects to all systems and propagates the merge to produce a clean, reconciled database.

GEDDocuments

CegedimHealth management

ZendeskCustomer support

Azure SQLDatabase

OtherInternal sources

▼ ▼ ▼ ▼ ▼

Merge Engine — PythonDuplicate listing as input · Multisource connection · Automated merging · Traceability

▼ ▼ ▼

Unified databaseSingle person view

Merge reportsMerged records · History

Audit logsComplete traceability

Python

Azure

Zendesk

Cegedim

Process

How it works the merge

From a listing of duplicates identified upstream, the process automatically propagates the merge across all connected systems.

Automated pipeline

From duplicate listing to the unified database

The process receives as input a listing of persons identified as duplicates by business teams. It then connects to all of the mutual's systems to propagate the merge: updating references, consolidating histories, and removing redundant records.

Listing ingestion — reading the duplicate file identified upstream, validating the format and identifiers
Multisource connection — simultaneous access to the DMS, Cegedim, Zendesk, Azure SQL, and all of the mutual's databases
Cross-system merging — for each duplicate, propagation of the merge across all systems: data consolidation, cross-reference updates
Traceability and reports — detailed log of each completed merge, summary report for audit and compliance

Listing reception

Ingestion of the duplicate file identified upstream

Intake

System connection

Simultaneous access to 5 data sources

Auto

Reference resolution

Identification of linked records in each system

ETL

Cross-system merging

Data consolidation, removal of redundant records

Auto

Report and traceability

Log of each merge, summary report

Audit

Measurable impact

Return on investment concrete

Before/after comparison of data quality at a national mutual insurance company.

What changes with automated deduplication

Indicator

Before

With BiDev

Duplicates in the database

Thousands of undetected duplicates

-85% of duplicates eliminated

Manual reconciliation

2-3 days / month per team

Automatic and continuous

Single customer view

Non-existent

Unified cross-source database

GDPR risk

High (duplicated data)

Compliant by design

New duplicate detection

After the fact, random

Real-time, automatic alert

Frequently Asked Questions

What our clients ask us

How does the deduplication process work?

Duplicate identification is performed upstream by business teams or the mutual's internal tools. Our process receives this listing of duplicate persons as input, then connects to all systems (DMS, Cegedim, Zendesk, Azure SQL) to automatically propagate the merge across each data source.

What data sources can you connect?

The Python script connects to any source exposing an API, a SQL database, files (CSV, Excel, XML), or a standard protocol. For this mutual: DMS (REST API), Cegedim (SQL database), Zendesk (REST API), Azure SQL Database, and file extractions from legacy systems.

Is the process GDPR-compliant?

Yes. Deduplication contributes to GDPR compliance by reducing redundant personal data. Processing is carried out on the client's infrastructure (Azure), and data never leaves the mutual's perimeter. A processing register is maintained in accordance with Article 30 of the GDPR.

Can the process be run regularly?

Yes. With each new listing of identified duplicates, the script can be rerun to propagate merges across all systems. The process is designed to be executed on a recurring basis as new duplicates are identified.

How long does it take to set up the system?

The project was completed in 3 months: 1 month analyzing sources and business rules, 1 month developing the multisource merge script, and 1 month of testing, adjustments, and production deployment. The initial cleanup of the existing database was carried out during the acceptance testing phase.

Have you identified a need?

Free 30-minute diagnostic — no commitment, confidential.

Request a diagnostic → View All References