Skip to main content

Ch#4

1. What is data reduction?

Answer:
Data reduction is the process of reducing the volume of data while producing the same or similar analytical results.


2. Why do we need data reduction?

Answer:

  • Improves model performance (speed & accuracy)

  • Helps in data visualization

  • Reduces dimensionality

  • Removes noise

  • Leads to simpler, faster, and more accurate models


3. What is feature selection (aka attribute/variable selection)?

Answer:
It’s the process of selecting an optimal subset of features from the data that contribute most to the model, based on a specific evaluation criterion.


5. What are the main techniques for feature selection (data reduction)?

Method Description Tools/Details
Wrapper Method Uses a classifier to evaluate feature subsets based on their performance - Generates all possible subsets- Uses a search technique to find best one
Filter Method Ranks features using an attribute evaluator and selects top-ranked ones - Doesn’t rely on classifier- Example in WEKA: InfoGainAttributeEval + Ranker

6. Difference between Wrapper and Filter methods?

Feature Wrapper Method Filter Method
Based on Classifier performance Statistical evaluation
Speed Slower (computationally expensive) Faster
Accuracy Generally more accurate May not consider interaction between features
Tool Example Classifier + subset evaluator InfoGain + Ranker in WEKA

Comments

Popular posts from this blog

Chap#10

Network topologies Definition: Network topologies define how nodes (processors/computers) are interconnected in parallel and distributed systems. The choice of topology affects performance, scalability, and cost. Key Metrics: Degree: Number of links per node. (Formula: deg = connections per node) Example: In a linear array, each node (except ends) has 2 links. Diameter: Longest shortest path between any two nodes. (Formula: diam = max distance) Example: Linear array with 8 nodes has diameter 7 (P₀ to P₇). Bisection Width: Minimum links to cut to split the network into two halves. (Formula: bw = min cuts) Example: Binary tree has bw=1 (cutting the root disconnects it).4 1. Linear Array Define : Nodes are connected one after another in a straight line. Each node (except the ends) connects to two neighbors one on the left and one on the right. Explanation : Simple to build and easy to understand, but not efficient for large networks. Long distance between farthest nodes makes comm...
Asymmetric-key algorithms are algorithms used in cryptography that use two different keys  a public key for encryption and a private key for decryption. These keys are mathematically related, but the private key cannot be easily derived from the public key. Types: RSA (Rivest–Shamir–Adleman): It uses large prime numbers to generate the key pair and supports both encryption and digital signatures DSA (Digital Signature Algorithm): DSA is primarily used for creating digital signatures, ensuring the authenticity. Symmetric-key algorithms are algorithms for cryptography that use the same cryptographic keys for both encryption of plaintext and decryption of ciphertext  Types: Stream Cipher:  Stream Cipher Converts the plain text into cipher text by taking 1 byte of plain text at a time. Block cipher: Converts the plain text into cipher text by taking plain text's block at a time DES? DES stands for Data Encryption Standard . It is a symmetric-key algorithm used to enc...

Ai Mental Health & Cyber Safety Presentation

Module A - The Normalization Engine Linguistic Challenge: Roman Urdu lacks standardized orthography (e.g., "kesa" vs "kaisa"), creating orthographic "noise" that significantly degrades the accuracy of downstream AI models. Technical Role: Acts as a Sequence-to-Sequence (Seq2Seq) transliteration and lexical normalization layer to standardize inputs before analysis. Model: A specialized transformer architecture, specifically m2m100 fine-tuned on parallel corpora or UrduParaphraseBERT. Primary Dataset: Roman-Urdu-Parl (RUP). A large-scale parallel corpus of 6.37 million sentence pairs designed to support machine transliteration and word embedding training. Link: https://arxiv.org/abs/2503.21530 Outcome: Reduces orthographic noise by achieving up to 97.44% Char-BLEU accuracy for Roman-Urdu to Urdu conversion, ensuring Module B receives high-quality "clean" data for risk analysis. Module B - Risk Stratification (BERT) Heading: The "Safety ...