Editing
Finding and deleting duplicates
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Deduplicating staging data== ===Catching up already stored data in staging schema=== The broad principle is: # Order the existing <code>indici_staging</code> data by primary key and <code>filesourcekey</code> values # Insert into temp table # Get the most recent record (via <code>filesourcekey</code> for each primary key) # Load dataset into target table in <code>rpt</code> schema. Item 1 can be achieved using example syntax: <syntaxhighlight lang="sql" line> create temp table tmp_quickconsult as select q.*, row_number() OVER ( PARTITION BY q."QuickConsultKeyID" ORDER BY q.filesourcekey desc ) AS r from indici_staging.quickconsult q ; </syntaxhighlight> This can then be written to target table using only <code>r=1</code>, and then dropping that column in target. Temp table can then be dropped also. ====Automating deduplication in <code>rpt</code> schema target==== For the first/earliest load you can obviously just insert straight into your target table, as described above. Ongoing subsequent runs for new incoming data obviously need to be deduplicated. This is done via the following syntax: <syntaxhighlight lang="sql" line> insert into rpt.quickconsult --your new target table select tq.* from dbt.tmp_quickconsult tq on conflict("QuickConsultKeyID") do update set -- set every single column :/ "Column1" = EXCLUDED."Column1", "Column2" = EXCLUDED."Column2", "Column3" = EXCLUDED."Column3" </syntaxhighlight> This code will: * INSERT into target table if primary key doesn't already exist * UPDATE the record with latest data if the primary key is found to already be in the table. More or less this is an 'UPSERT' process. Above code should be built as a database function, so it can be easily called on demand. This function call is then executed via a db cron job - this is described in [[Indici data]]. ====Finding and Deleting Duplicates==== To ensure only the most recent record for each primary key is retained, duplicates can be identified and removed using the following steps: 1. **Identify Duplicates**: Use the `ROW_NUMBER()` function to assign a rank to each row for the same primary key, ordered by `kptinsertedat` or `filesourcekey`. For example: <syntaxhighlight lang="sql" line> WITH ranked_rows AS ( SELECT "ACC45ID", filesourcekey, kptinsertedat, ROW_NUMBER() OVER (PARTITION BY "ACC45ID" ORDER BY kptinsertedat DESC) AS r FROM indici_staging.accidents ) SELECT * FROM ranked_rows WHERE r > 1; </syntaxhighlight> 2. **Delete Duplicates**: Once duplicates are identified, rows where <code>r > 1</code> can be deleted while retaining the most recent record: <syntaxhighlight lang="sql" line> WITH ranked_rows AS ( SELECT "ACC45ID", filesourcekey, ROW_NUMBER() OVER (PARTITION BY "ACC45ID" ORDER BY kptinsertedat DESC) AS r FROM indici_staging.accidents ) DELETE FROM indici_staging.accidents WHERE ( "ACC45ID", filesourcekey ) IN ( SELECT "ACC45ID", filesourcekey FROM ranked_rows WHERE r > 1 ); </syntaxhighlight> This ensures that only the most recent record for each primary key remains in the staging table, ready to be inserted into the target table.
Summary:
Please note that all contributions to Kautepedia are considered to be released under the Creative Commons Attribution-NonCommercial-ShareAlike (see
Kautepedia:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
British English
Not logged in
Talk
Contributions
Log in
Namespaces
Page
Discussion
British English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information