Skip to main content

Ch#3

1. What is data integration?

Answer:

Data integration is the process of collecting and combining data from different sources and bringing it together in a unified way so it can be analyzed, reported on, or used for decision-making.

Think of it like gathering pieces of a puzzle from different boxes and putting them together to see the full picture.

2. What are common issues in data integration?

Answer:

  • Schema Integration: Merging metadata from different sources

  • Entity Identification: Matching real-world entities (e.g., A.cust-id ≡ B.cust-#)

  • Data Value Conflicts: Different units/scales (e.g., km vs miles)

  • Redundant Data: Same attributes with different names.

  • Inconsistencies: Conflicting or duplicated information

4. What is data transformation?

Answer:

Data transformation is the process of converting data from its original (raw) format into a format that is clean, consistent, and ready for analysis, mining, or storage. It often involves changing the structure, format, or values of the data.

5. What are the key data transformation techniques?

Answer:

  • Smoothing: Removes noise (e.g., using binning, regression, clustering)

  • Aggregation: Summarization (e.g., constructing data cubes)

  • Generalization: Replacing low-level data with higher-level concepts

  • Attribute Construction: Creating new features/attributes

  • Normalization: Scaling data within a specific range


6. Why is data normalization important?

Answer:

  • Speeds up processing

  • Reduces memory usage

  • Ensures fair contribution of attributes during mining


7. What are the main normalization techniques?

Technique Description Example
Min-Max Scales data to a specified range If range is 5-100 → scale to 0-1
Z-score Uses mean & std deviation; good with outliers (x - mean)/std
Decimal Scaling Divides by 10^n based on max absolute value -500 → -0.5

8. Give an example of generalization.

Answer:
Age: 20-25 → Age1, 26-30 → Age2, etc.
Street: SP PJ 1 → SP, LHR 2 → PUNJAB, etc.

Comments

Popular posts from this blog

Chap#10

Network topologies Definition: Network topologies define how nodes (processors/computers) are interconnected in parallel and distributed systems. The choice of topology affects performance, scalability, and cost. Key Metrics: Degree: Number of links per node. (Formula: deg = connections per node) Example: In a linear array, each node (except ends) has 2 links. Diameter: Longest shortest path between any two nodes. (Formula: diam = max distance) Example: Linear array with 8 nodes has diameter 7 (P₀ to P₇). Bisection Width: Minimum links to cut to split the network into two halves. (Formula: bw = min cuts) Example: Binary tree has bw=1 (cutting the root disconnects it).4 1. Linear Array Define : Nodes are connected one after another in a straight line. Each node (except the ends) connects to two neighbors one on the left and one on the right. Explanation : Simple to build and easy to understand, but not efficient for large networks. Long distance between farthest nodes makes comm...
Asymmetric-key algorithms are algorithms used in cryptography that use two different keys  a public key for encryption and a private key for decryption. These keys are mathematically related, but the private key cannot be easily derived from the public key. Types: RSA (Rivest–Shamir–Adleman): It uses large prime numbers to generate the key pair and supports both encryption and digital signatures DSA (Digital Signature Algorithm): DSA is primarily used for creating digital signatures, ensuring the authenticity. Symmetric-key algorithms are algorithms for cryptography that use the same cryptographic keys for both encryption of plaintext and decryption of ciphertext  Types: Stream Cipher:  Stream Cipher Converts the plain text into cipher text by taking 1 byte of plain text at a time. Block cipher: Converts the plain text into cipher text by taking plain text's block at a time DES? DES stands for Data Encryption Standard . It is a symmetric-key algorithm used to enc...

Ai Mental Health & Cyber Safety Presentation

Module A - The Normalization Engine Linguistic Challenge: Roman Urdu lacks standardized orthography (e.g., "kesa" vs "kaisa"), creating orthographic "noise" that significantly degrades the accuracy of downstream AI models. Technical Role: Acts as a Sequence-to-Sequence (Seq2Seq) transliteration and lexical normalization layer to standardize inputs before analysis. Model: A specialized transformer architecture, specifically m2m100 fine-tuned on parallel corpora or UrduParaphraseBERT. Primary Dataset: Roman-Urdu-Parl (RUP). A large-scale parallel corpus of 6.37 million sentence pairs designed to support machine transliteration and word embedding training. Link: https://arxiv.org/abs/2503.21530 Outcome: Reduces orthographic noise by achieving up to 97.44% Char-BLEU accuracy for Roman-Urdu to Urdu conversion, ensuring Module B receives high-quality "clean" data for risk analysis. Module B - Risk Stratification (BERT) Heading: The "Safety ...