Editing
Update Indici data load
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Background== By default, we only load fields from [[Indici]] data that we absolutely need. In some cases, it turns out that we are not loading data that we actually need for a particular job. In this case, there are some key steps required to do this cleanly. ==Identify data points that are required== All target DB tables accommodate all fields; but not all of them are populated from our load job. Identifying required data can be done by: * Picking up a parquet file with recent transactions recorded in <code>indici_staging.auditlog</code>. * Eyeballing the file contents. One way to do this is a python script something like the following: <source lang="python"> import pandas as pd import numpy as np pd.options.display.max_columns = None # show all columns df = pd.read_parquet('your_file.parquet', engine='pyarrow') print(df.head(450)) </source> Note that some files have a lot of columns and will truncate what is visible without the <code>max_columns</code> directive. Also note that <code>fastparquet</code> can be used as an engine instead of <code>pyarrow</code>. ==Prepare database== Next it is necessary to truncate your target table, and remove associated records in <code>indici_staging.auditlog</code>. ===Truncate target tables<ref>Also be aware that some datatypes may not be set correctly for your new data. In this scenario, target will have to be recreated.</ref>=== <source lang="sql"> truncate table indici_staging.target_table; truncate table rpt.target_table; </source> ===Remove auditlog entries=== <source lang="sql"> delete from indici_staging.auditlog where "table" = 'your_new_table'; </source> ==Update SQL definition== Next you must change the SQL definition stored in DynamoDB table <code>indiciLoadSQL</code>, and used by the lambda insert function. The partition key will equate to the target table name, and have three values associated with it: * <code>df</code> = DataFrame definition for subsetting data from the incoming file * <code>sql</code> = SQL insert into <code>indici_staging</code> * <code>sql_dd</code> = SQL for deduplication and insert/merge into <code>rpt</code> (used by Step Functions) You must therefore update both the <b>df</b> and <b>sql</b> values to add in any new fields. Save your changes in DynamoDB. The Lambda will use the new definition on its next run. ==Backload existing data== To reprocess existing files, use the bulk reupload Lambda. This Lambda: * Loads objects from <code>kpa-valentia</code> matching a prefix (e.g. <code>Invoices_</code>) * Re-uploads them to the same bucket using <code>PutObject</code> logic via <code>copy_from</code> * This triggers the Step Function via EventBridge (which listens for S3 <code>PutObject</code> events on <code>kpa-valentia</code>) ===Steps=== # Set your prefix (e.g. <code>Invoices_</code>) at the top of the Lambda # Run the Lambda β it will: ## Match all files starting with that prefix and ending in <code>.parquet</code> ## Re-upload each file using <code>copy_from()</code> to itself ## This will trigger the Step Function, which will: ### Call <code>copyIndiciFiles</code> ### Load data into <code>indici_staging</code> ### Call <code>rptDeduplication</code> to populate <code>rpt</code> ===Monitoring=== You can monitor execution through: * CloudWatch logs for the Lambda * Step Function execution history (each file triggers an execution) * Checking row counts in <code>indici_staging.your_table</code> and <code>rpt.your_table</code> ==References== [[Category:Indici]]
Summary:
Please note that all contributions to Kautepedia are considered to be released under the Creative Commons Attribution-NonCommercial-ShareAlike (see
Kautepedia:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
British English
Not logged in
Talk
Contributions
Log in
Namespaces
Page
Discussion
British English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information