Write a Data Cleaning Script
Generate Python/SQL scripts to clean messy data — handle nulls, standardize formats, remove duplicates.
The Prompt
Write a data cleaning script for the following dataset. Handle: 1. Missing values: identify columns and appropriate fill strategies (mean/median/mode/flag/drop) 2. Duplicates: detection and removal strategy 3. Inconsistent formatting: standardize dates, phone numbers, emails, categorical values 4. Outlier handling: identify and flag/cap/remove based on context 5. Data type corrections 6. New derived columns that would be useful 7. Validation checks after cleaning Language: [PYTHON (pandas) / SQL] Dataset description: [DESCRIBE YOUR DATASET — column names, types, known issues] Sample dirty data: ``` [PASTE A FEW ROWS SHOWING THE ISSUES] ```
Example Output
Python script using pandas: fills missing revenue with 0 for inactive accounts and median for active ones, standardizes all phone numbers to E.164 format using regex, drops duplicates keeping the most recent row, caps revenue outliers at 99th percentile, and adds a data_quality_score column for downstream use.
FAQ
Which AI model is best for Write a Data Cleaning Script?
GPT-4o or Claude Sonnet 4 — both write solid pandas/SQL data cleaning code.
How do I use the Write a Data Cleaning Script prompt?
Copy the prompt, replace the [BRACKETED] placeholders with your specific information, and paste into your preferred AI assistant (ChatGPT, Claude, Gemini, etc.). Python script using pandas: fills missing revenue with 0 for inactive accounts and median for active ones, standardizes all phone numbers to E.164 format using regex, drops duplicates keeping the most recent row, caps revenue outliers at 99th percentile, and adds a data_quality_score column for downstream use.
Model Recommendation
GPT-4o or Claude Sonnet 4 — both write solid pandas/SQL data cleaning code.