Untitled Quiz

Analyzes customer feedback and sentiments to help businesses improve their products and services
Applications include:
- Sentiment Analysis: understanding customer opinions and emotions in reviews and social media
- Topic Modeling: identifying key themes and topics within large text corpora
- Spam Detection: filtering out unwanted emails and messages
- Information Retrieval: improving search engines by providing more relevant results
- Healthcare: analyzing medical records and research papers to support clinical decisions

Definition: reducing a word to its base or root form, typically by removing suffixes
Purpose: normalizing words to their root form to ensure that different variants of a word are treated as the same word in text analysis

Emerging trends and potential areas of research include:
- Improved Accuracy: selecting the most informative features to focus on the relevant aspects of the text
- Interpretability: models with fewer features are easier to interpret and understand, which is crucial in many applications
- Impact on Model Performance: reducing the number of features means faster training and prediction times, less risk of overfitting, and simpler models

Decision Tree Classifiers:
- Principle: splitting the data into subsets based on the value of input features, creating a tree-like model of decisions
- Advantages: easy to understand and visualize, no assumptions about the distribution of data, and can naturally rank the importance of features
- Disadvantages: prone to overfitting, especially with complex trees, and can be biased towards features with more levels
Proximity-based Classifiers (e.g., k-NN):
- Principle: classifying documents based on their proximity to other documents in the feature space, usually using distance metrics
- Advantages: simple and intuitive, no training phase, and adaptable to new data
- Disadvantages: high computational cost during prediction, especially with large datasets

Definition: aggregating results from multiple search engines, providing a unified list of results
Techniques for rank positions:
- Combining Algorithms:
  - Simple Aggregation: combining ranks from different search engines by averaging or summing their positions
  - Weighted Aggregation: assigning weights to different search engines based on their perceived relevance or performance
  - Borda Count: a rank aggregation method where each position is assigned points, and documents are ranked based on total points
- Rank Fusion:
  - Round-Robin: selecting results in a round-robin fashion from different search engines
  - Condorcet Fusion: using a voting-based method where each pair of results is compared, and the one preferred by the majority is ranked higher
- Machine Learning:
  - Learning to Rank: training a machine learning model using features from different search engines to predict the best rank for a document

Content Spamming:
- Keyword Stuffing: overloading a webpage with keywords to manipulate search engine rankings
- Cloaking: serving different content to search engines than what is visible to users to deceive search algorithms
- Hidden Text: using invisible text to stuff keywords without affecting the page's appearance to users
Link Spamming:
- Link Farms: creating a network of interlinked websites to artificially boost the link popularity of each site
- Paid Links: buying or selling links to manipulate PageRank or search rankings
Comment Spam: posting irrelevant or low-quality comments on blogs and forums with links back to the spammer's site
Redirection:
- Sneaky Redirects: automatically redirecting users to a different page than what was indexed by the search engine
- Doorway Pages: creating multiple pages that lead to the same destination to rank for various search queries

Content Spamming:
- Detection Complexity: sophisticated spammers can create content that appears legitimate, making it hard to distinguish from genuine content
- Text Preprocessing: cleaning the text by removing noise, handling missing data, and normalizing text
- Feature Extraction: using techniques like tokenization, TF-IDF, word embeddings, and sentence embeddings to capture semantic meaning
Challenges in Combating Web Spam (continued):
- Supervised Learning: training models using labeled data with known sentiment labels, and evaluating models using cross-validation and metrics like accuracy, precision, recall, and F1-score
- Unsupervised Learning: applying clustering techniques to group similar documents and infer sentiment based on cluster characteristics
- Hybrid Approach: combining supervised and unsupervised methods to leverage the strengths of both

Podcast