Deduplicating Indici data

From Kautepedia
(Redirected from Deduplicating Indici data,)
Jump to navigation Jump to search

Background[edit | edit source]

Indici shares delta data on a daily basis in parquet format. This contains a large number of apparent duplicates.

While all incoming data is just stashed as-is in the indici_staging schema, a separate process is required to deduplicate so it can be reliably used for reporting.

This page is meant to detail exactly how that process should be built/executed.

Deduplicating staging data[edit | edit source]

Catching up already stored data in staging schema[edit | edit source]

The broad principle is:

  1. Order the existing indici_staging data by primary key and filesourcekey values
  2. Insert into temp table
  3. Get the most recent record (via filesourcekey for each primary key)
  4. Load dataset into target table in rpt schema.

Item 1 can be achieved using example syntax:

create temp table tmp_quickconsult as
select
q.*,
row_number() OVER (PARTITION BY q."QuickConsultKeyID" ORDER BY q.filesourcekey DESC) AS r
from indici_staging.quickconsult q
;

This can then be written to target table using only r=1, and then dropping that column in target. Temp table can then be dropped also.


Automating deduplication in rpt schema target[edit | edit source]

For the first/earliest load you can obviously just insert straight into your target table, as described above.

Ongoing subsequent runs for new incoming data obviously need to be deduplicated. This is done via the following syntax:

insert into rpt.quickconsult --your new target table
select tq.*
from dbt.tmp_quickconsult tq
on conflict("QuickConsultKeyID")
do update set -- set every single column :/

"Column1" = EXCLUDED."Column1",
"Column2" = EXCLUDED."Column2",
"Column3" = EXCLUDED."Column3"

This code will:

  • INSERT into target table if primary key doesn't already exist
  • UPDATE the record with latest data if the primary key is found to already be in the table.

More or less this is an 'UPSERT' process.

Above code should be built as a database function, so it can be easily called on demand. This function call is then executed via a db cron job - this is described in Indici data.

References[edit | edit source]