1. What is data integration?
Answer:
Data integration is the process of collecting and combining data from different sources and bringing it together in a unified way so it can be analyzed, reported on, or used for decision-making.
Think of it like gathering pieces of a puzzle from different boxes and putting them together to see the full picture.
2. What are common issues in data integration?
Answer:
Schema Integration: Merging metadata from different sources
Entity Identification: Matching real-world entities (e.g.,
A.cust-id ≡ B.cust-#)Data Value Conflicts: Different units/scales (e.g., km vs miles)
Redundant Data: Same attributes with different names.
Inconsistencies: Conflicting or duplicated information
4. What is data transformation?
Answer:
Data transformation is the process of converting data from its original (raw) format into a format that is clean, consistent, and ready for analysis, mining, or storage. It often involves changing the structure, format, or values of the data.
5. What are the key data transformation techniques?
Answer:
-
Smoothing: Removes noise (e.g., using binning, regression, clustering)
-
Aggregation: Summarization (e.g., constructing data cubes)
-
Generalization: Replacing low-level data with higher-level concepts
-
Attribute Construction: Creating new features/attributes
-
Normalization: Scaling data within a specific range
6. Why is data normalization important?
Answer:
-
Speeds up processing
-
Reduces memory usage
-
Ensures fair contribution of attributes during mining
7. What are the main normalization techniques?
| Technique | Description | Example |
|---|---|---|
| Min-Max | Scales data to a specified range | If range is 5-100 → scale to 0-1 |
| Z-score | Uses mean & std deviation; good with outliers | (x - mean)/std |
| Decimal Scaling | Divides by 10^n based on max absolute value | -500 → -0.5 |
8. Give an example of generalization.
Answer:
Age: 20-25 → Age1, 26-30 → Age2, etc.
Street: SP PJ 1 → SP, LHR 2 → PUNJAB, etc.
Comments
Post a Comment