Important Questions & Answers from CH2: Data Cleaning
1. Why is data preprocessing important?
Answer:
Because real-world data is often incomplete, noisy, and inconsistent, and poor-quality data can lead to misleading results. Data preprocessing ensures that the data is clean, consistent, and ready for mining — it accounts for 80% of the KDD effort.
2. What are the major tasks in data preprocessing?
Answer:
-
Data Cleaning (handle missing, noisy, outlier data)
-
Data Integration (combine data from multiple sources)
-
Data Reduction (reduce volume while maintaining accuracy)
-
Data Transformation (normalize, aggregate, discretize data)
3. What makes real-world data "dirty"?
Answer:
-
Incomplete: Missing values
-
Noisy: Errors or outliers (e.g., negative salary)
-
Inconsistent: Conflicting entries or formats (e.g., different rating scales or DOB mismatches)
4. How can we handle missing data?
Answer:
-
Ignore the record (not recommended)
-
Fill with:
-
Mean (numerical data)
-
Mode (categorical data)
-
Median (skewed data)
-
-
Interpolate (for grouped/ordered data)
5. What is noisy data and how do we handle it?
Answer:
Noisy data is random error or variance caused by faulty instruments or entry errors.
Handled using:
-
Binning methods (mean, median, boundaries)
-
Clustering (to find and remove outliers)
-
Manual checking
-
Discretization techniques
6. What is binning in data cleaning?
Answer:
Binning in data cleaning is a way to make messy or noisy numbers easier to understand by grouping them into smaller ranges, called bins.
Think of it like sorting students' test scores into grade ranges:
-
0–50
-
51–70
-
71–100
Instead of looking at every single score, we just look at which range (or bin) they fall into.
There are a few ways to clean up the data once it's in bins:
-
By mean – Replace all the values in the bin with the average of that bin.
Example: If a bin has 60, 65, and 70, replace all with the average: 65. -
By median – Replace all the values with the middle value in the bin.
Example: For 60, 65, 70 → the median is 65, so all become 65. -
By boundaries – Replace each number with the closest end of the bin.
*Example: If a bin is 60–70:-
61 becomes 60 (closer to 60)
-
69 becomes 70 (closer to 70)*
-
7. What is discretization in data preprocessing?
Answer:
Discretization means turning continuous data (like height, weight, or temperature, which can have many possible values) into smaller groups or categories, called buckets or bins.
There are two common ways to do this:
-
Equal-width discretization:
-
The entire range of values is divided into intervals that are all the same size.
-
Example: If values go from 0 to 100 and we want 5 bins, we get:
0–20, 21–40, 41–60, 61–80, 81–100.
-
-
Equal-frequency discretization:
-
Each bin has the same number of data points, even if the intervals are different in size.
-
Example: If we have 100 values and want 4 bins, each bin will have 25 values, no matter how wide the ranges are.
-
8. What is an outlier? How is it detected?
Answer:
An outlier is a value that significantly differs from other observations.
Detected using:
-
Box Plot/Whisker analysis
-
Compute Q1 (25th percentile), Q3 (75th percentile)
-
IQR = Q3 - Q1
-
Outlier if data < Q1 - 1.5×IQR or > Q3 + 1.5×IQR
-
9. What is the five-number summary in box plot analysis?
Answer:
The five-number summary is a quick way to describe a set of numbers. It includes:
-
Minimum – The smallest value (excluding outliers)
-
Q1 (1st quartile) – The value at 25% of the data
-
Median (Q2) – The middle value (50% point)
-
Q3 (3rd quartile) – The value at 75% of the data
-
Maximum – The largest value (excluding outliers)
10. What are the key statistics used in data cleaning?
Answer:
These are the main numbers we use to understand and clean data:
-
Mean – The average of all values (good for general trends, but sensitive to outliers)
-
Median – The middle value (better when data has outliers)
-
Mode – The most common value in the data
-
Variance – Shows how spread out the data is
-
Standard Deviation – Also shows spread, but in the same units as the data
-
IQR (Interquartile Range) – Helps find outliers by measuring the range of the middle 50% of values
Comments
Post a Comment