Write My Paper Button

WhatsApp Widget

Write My Paper Button

WhatsApp Widget

COMP6714 2025T2 Project Specification 1. Project Overview In this project, you will implement (using Python3 in CSE Linux machines) a simple search engine that ranks

COMP6714 2025T2 Project Specification

1. Project Overview

In this project, you will implement (using Python3 in CSE Linux machines) a simple search engine that ranks documents based on:

  • Query term coverage
  • Proximity of matched terms
  • Preservation of query term order

A search query consists of space-separated terms containing only alphanumeric characters (no punctuation).

2. Core Requirements

  • Implement an indexer (index.py) and a search program (search.py).
  • Use an inverted index with positional information (as described in Week 1 lectures).
  • Additional indexes may be implemented if needed.

3. Term Matching Rules

  • Case insensitive (e.g., “Apple” = “apple”).
  • Abbreviations: Ignore full stops (e.g., “U.S.” = “US”).
  • Hyphenated terms:
    • Preserve if the first part has < 3 letters (e.g., “D-Kans”, “co-author”).
    • Split otherwise (e.g., “set-aside” → “set”, “aside”).
  • Singular/Plural/Tense ignored (e.g., “cat” = “cats”; “breach” = “breached”).
  • Sentence endings: Only ., ?, ! mark sentence boundaries.
  • Numbers:
    • Decimal numbers can be ignored (. is invalid in search terms).
    • Years/integers should be indexed (commas ignored, e.g., “1,000,000” = “1000000”).
  • Other punctuation: Treated as token dividers.

4. Ranking Criteria

Documents are ranked by:

  1. Term coverage (proportion of query terms matched).
  2. Proximity (average distance between matched terms).
  3. Order preservation (consecutive query terms appearing in the same left-to-right order).

Scoring Formula:
[
Score(d) = alpha * frac{#matched_terms}{#query_terms} + beta * frac{1}{1 + avg_distance} + gamma * ordered_pairs
]
Where:

  • (alpha = 1.0), (beta = 1.0), (gamma = 0.1) (default values).
  • For single-term queries, proximity and order scores are 0.

5. Indexer (index.py)

Command:

python3 index.py [folder-of-documents] [folder-of-indexes]

Output:

  • Total documents, tokens, and terms indexed.

Example:

$ python3 index.py ~cs6714/Public/data ./MyTestIndex  
Total number of documents: 1000  
Total number of tokens: 268,568  
Total number of terms: 259,182  

6. Search Program (search.py)

Command:

python3 search.py [folder-of-indexes]

Behavior:

  • Accepts queries from stdin until Ctrl-D.
  • Outputs ranked document IDs (one per line).

Example:

$ python3 search.py ~/Proj/MyTestIndex  
Apple  
1361  
Australia Technology  
3454  
10  
18  
...  

7. Displaying Matching Lines (Optional)

For queries starting with > :

  • Displays document IDs prefixed with > followed by lines containing the closest matching terms.
  • Only one line per matched term (prioritizing earliest occurrence).

Example:

$ python3 search.py ~/Proj/MyTestIndex  
> Apples  
> 1361  
The department said stocks of fresh apples in cold storage  

8. Marking (40 Points Total)

  • Correctness: Exact match of document IDs and order required for full marks.
  • Partial Marks: F-measure used for ranking errors (precision/recall).
  • Runtime Limits:
    • Indexer: 1 minute.
    • Search: 10 seconds per query.

9. Submission

  • Deadline: Friday, 1st August 23:59.
  • Format: All .py files in a .zip folder submitted via Moodle.
  • Late Penalty: 5% deduction per day (up to 5 days).

10. Permitted Libraries

  • Python Standard Library only.
  • NLTK allowed (pre-downloaded for marking; remove nltk.download() calls).

11. Plagiarism Policy

  • Individual work only.
  • Penalties apply for copied code or public repositories.