Deduplicating Indici data

Background[edit | edit source]

Indici shares delta data on a daily basis in parquet format. This contains a large number of apparent duplicates.

While all incoming data is just stashed as-is in the indici_staging schema, a separate process is required to deduplicate so it can be reliably used for reporting.

This page is meant to detail exactly how that process should be built/executed.

Deduplicating staging data[edit | edit source]

Catching up already stored data in staging schema[edit | edit source]

The broad principle is:

Order the existing indici_staging data by primary key and filesourcekey values
Insert into temp table
Get the most recent record (via filesourcekey for each primary key)
Load dataset into target table in rpt schema.

Item 1 can be achieved using example syntax:

create temp table tmp_quickconsult as
select
q.*,
row_number() OVER (PARTITION BY q."QuickConsultKeyID" ORDER BY q.filesourcekey DESC) AS r
from indici_staging.quickconsult q
;

This can then be written to target table using only r=1, and then dropping that column in target. Temp table can then be dropped also.

Automating deduplication in `rpt` schema target[edit | edit source]

For the first/earliest load you can obviously just insert straight into your target table, as described above.

Ongoing subsequent runs for new incoming data obviously need to be deduplicated. This is done via the following syntax:

insert into rpt.quickconsult --your new target table
select tq.*
from dbt.tmp_quickconsult tq
on conflict("QuickConsultKeyID")
do update set -- set every single column :/

"Column1" = EXCLUDED."Column1",
"Column2" = EXCLUDED."Column2",
"Column3" = EXCLUDED."Column3"

This code will:

INSERT into target table if primary key doesn't already exist
UPDATE the record with latest data if the primary key is found to already be in the table.

More or less this is an 'UPSERT' process.

Above code should be built as a database function, so it can be easily called on demand. This function call is then executed via a db cron job - this is described in Indici data.

References[edit | edit source]

Deduplicating Indici data

Contents

Background[edit | edit source]

Deduplicating staging data[edit | edit source]

Catching up already stored data in staging schema[edit | edit source]

Automating deduplication in `rpt` schema target[edit | edit source]

References[edit | edit source]

Navigation menu

Deduplicating Indici data

Background[edit | edit source]

Deduplicating staging data[edit | edit source]

Catching up already stored data in staging schema[edit | edit source]

Automating deduplication in rpt schema target[edit | edit source]

References[edit | edit source]

Navigation menu

Search

Automating deduplication in `rpt` schema target[edit | edit source]