Skip to main content

Ch#2

Important Questions & Answers from CH2: Data Cleaning


1. Why is data preprocessing important?

Answer:
Because real-world data is often incomplete, noisy, and inconsistent, and poor-quality data can lead to misleading results. Data preprocessing ensures that the data is clean, consistent, and ready for mining — it accounts for 80% of the KDD effort.


2. What are the major tasks in data preprocessing?

Answer:

  • Data Cleaning (handle missing, noisy, outlier data)

  • Data Integration (combine data from multiple sources)

  • Data Reduction (reduce volume while maintaining accuracy)

  • Data Transformation (normalize, aggregate, discretize data)


3. What makes real-world data "dirty"?

Answer:

  • Incomplete: Missing values

  • Noisy: Errors or outliers (e.g., negative salary)

  • Inconsistent: Conflicting entries or formats (e.g., different rating scales or DOB mismatches)


4. How can we handle missing data?

Answer:

  • Ignore the record (not recommended)

  • Fill with:

    • Mean (numerical data)

    • Mode (categorical data)

    • Median (skewed data)

  • Interpolate (for grouped/ordered data)


5. What is noisy data and how do we handle it?

Answer:
Noisy data is random error or variance caused by faulty instruments or entry errors.
Handled using:

  • Binning methods (mean, median, boundaries)

  • Clustering (to find and remove outliers)

  • Manual checking

  • Discretization techniques


6. What is binning in data cleaning?

Answer:

Binning in data cleaning is a way to make messy or noisy numbers easier to understand by grouping them into smaller ranges, called bins.

Think of it like sorting students' test scores into grade ranges:

  • 0–50

  • 51–70

  • 71–100

Instead of looking at every single score, we just look at which range (or bin) they fall into.

There are a few ways to clean up the data once it's in bins:

  1. By mean – Replace all the values in the bin with the average of that bin.
    Example: If a bin has 60, 65, and 70, replace all with the average: 65.

  2. By median – Replace all the values with the middle value in the bin.
    Example: For 60, 65, 70 → the median is 65, so all become 65.

  3. By boundaries – Replace each number with the closest end of the bin.
    *Example: If a bin is 60–70:

    • 61 becomes 60 (closer to 60)

    • 69 becomes 70 (closer to 70)*

7. What is discretization in data preprocessing?

Answer:

Discretization means turning continuous data (like height, weight, or temperature, which can have many possible values) into smaller groups or categories, called buckets or bins.

There are two common ways to do this:

  1. Equal-width discretization:

    • The entire range of values is divided into intervals that are all the same size.

    • Example: If values go from 0 to 100 and we want 5 bins, we get:
      0–20, 21–40, 41–60, 61–80, 81–100.

  2. Equal-frequency discretization:

    • Each bin has the same number of data points, even if the intervals are different in size.

    • Example: If we have 100 values and want 4 bins, each bin will have 25 values, no matter how wide the ranges are.


8. What is an outlier? How is it detected?

Answer:
An outlier is a value that significantly differs from other observations.
Detected using:

  • Box Plot/Whisker analysis

    • Compute Q1 (25th percentile), Q3 (75th percentile)

    • IQR = Q3 - Q1

    • Outlier if data < Q1 - 1.5×IQR or > Q3 + 1.5×IQR


9. What is the five-number summary in box plot analysis?

Answer:

The five-number summary is a quick way to describe a set of numbers. It includes:

  1. Minimum – The smallest value (excluding outliers)

  2. Q1 (1st quartile) – The value at 25% of the data

  3. Median (Q2) – The middle value (50% point)

  4. Q3 (3rd quartile) – The value at 75% of the data

  5. Maximum – The largest value (excluding outliers)

10. What are the key statistics used in data cleaning?

Answer:

These are the main numbers we use to understand and clean data:

  1. Mean – The average of all values (good for general trends, but sensitive to outliers)

  2. Median – The middle value (better when data has outliers)

  3. Mode – The most common value in the data

  4. Variance – Shows how spread out the data is

  5. Standard Deviation – Also shows spread, but in the same units as the data

  6. IQR (Interquartile Range) – Helps find outliers by measuring the range of the middle 50% of values

Comments

Popular posts from this blog

Chap#10

Network topologies Definition: Network topologies define how nodes (processors/computers) are interconnected in parallel and distributed systems. The choice of topology affects performance, scalability, and cost. Key Metrics: Degree: Number of links per node. (Formula: deg = connections per node) Example: In a linear array, each node (except ends) has 2 links. Diameter: Longest shortest path between any two nodes. (Formula: diam = max distance) Example: Linear array with 8 nodes has diameter 7 (P₀ to P₇). Bisection Width: Minimum links to cut to split the network into two halves. (Formula: bw = min cuts) Example: Binary tree has bw=1 (cutting the root disconnects it).4 1. Linear Array Define : Nodes are connected one after another in a straight line. Each node (except the ends) connects to two neighbors one on the left and one on the right. Explanation : Simple to build and easy to understand, but not efficient for large networks. Long distance between farthest nodes makes comm...
Asymmetric-key algorithms are algorithms used in cryptography that use two different keys  a public key for encryption and a private key for decryption. These keys are mathematically related, but the private key cannot be easily derived from the public key. Types: RSA (Rivest–Shamir–Adleman): It uses large prime numbers to generate the key pair and supports both encryption and digital signatures DSA (Digital Signature Algorithm): DSA is primarily used for creating digital signatures, ensuring the authenticity. Symmetric-key algorithms are algorithms for cryptography that use the same cryptographic keys for both encryption of plaintext and decryption of ciphertext  Types: Stream Cipher:  Stream Cipher Converts the plain text into cipher text by taking 1 byte of plain text at a time. Block cipher: Converts the plain text into cipher text by taking plain text's block at a time DES? DES stands for Data Encryption Standard . It is a symmetric-key algorithm used to enc...

Ai Mental Health & Cyber Safety Presentation

Module A - The Normalization Engine Linguistic Challenge: Roman Urdu lacks standardized orthography (e.g., "kesa" vs "kaisa"), creating orthographic "noise" that significantly degrades the accuracy of downstream AI models. Technical Role: Acts as a Sequence-to-Sequence (Seq2Seq) transliteration and lexical normalization layer to standardize inputs before analysis. Model: A specialized transformer architecture, specifically m2m100 fine-tuned on parallel corpora or UrduParaphraseBERT. Primary Dataset: Roman-Urdu-Parl (RUP). A large-scale parallel corpus of 6.37 million sentence pairs designed to support machine transliteration and word embedding training. Link: https://arxiv.org/abs/2503.21530 Outcome: Reduces orthographic noise by achieving up to 97.44% Char-BLEU accuracy for Roman-Urdu to Urdu conversion, ensuring Module B receives high-quality "clean" data for risk analysis. Module B - Risk Stratification (BERT) Heading: The "Safety ...