Please find below the use case which we are looking to implement in front of Data Scrubbing and Synthetic Data Generation
a. Read data files from production s3 bucket
b. Scrub the data from files containing sensitive data
c. Replace scrubbed data with masked/synthetic data
d. Validate that files do not contain any original data
e. Validate synthetic data generated matches original schema
f. Validate Meta data (number of rows, number of columns etc. matches )
g. Generate statistics on scrubbing operation
h. Copy final data to alternate s3 bucket
Synthetic Data generation
a. read data model from user- schema, sample data and custom field information- how data should be generated
b. Validate that files do not contain any original data
c. Validate synthetic data generated matches provide data model schema
d. Validate data Meta data (number of rows, number of columns etc. matches )
e. Generate statistics on data generation
f. Copy the final data to the s3 bucket
Please find below our queries/concerns
1. What are the types of files it supports?
2. Does it support scrubbing and Synthetic data generation?
3. Does it support Validation and if yes then what kinds of validation?
4. Does it support AWS-S3 connectivity
5. What kind of algorithm it uses
6. Does it support Snowflake and Redshift connectivity?
7. What maximum size of file it supports- we have a requirement of around ~100GB
8. Can we support PII, Parquet formatted files?
Please share this information with us.