Ifeoma Grassl
updated readme
fdbb03a

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: Advanced Information Retrieval Test
emoji: 
colorFrom: green
colorTo: pink
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Advanced Information Retrieval

EU Data Protection Semantic Search + AI Summary

This web application allows users to search EU data protection regulations, specifically the GDPR (General Data Protection Regulation) and the LED (Law Enforcement Directive), using semantic search or keyword search and receive an AI-generated summary of the most relevant documents.

The app consists of FAISS-based semantic retrieval, TF-IDF keyword search, and a Hugging Face language model for summarization.

Link to app hosted on Hugging Face Spaces - You can use it directly in your browser - no installation required


Features

  • Semantic Search: Retrieve relevant documents using embeddings with a pre-trained sentence transformer.
  • Keyword Search: Retrieve documents using classical TF-IDF vectorization and cosine similarity.
  • AI Summary: Generate concise, human-readable summaries using a Hugging Face summarization model.
  • Top-K Results: Displays the top 5 most relevant documents for each query.

How to Run the System

  1. Open the link https://huggingface.co/spaces/ifi-grassl/Advanced_Information_Retrieval_Test
  2. Enter your query in the text box.
  3. Choose a retrieval method:
    • Semantic Search (embedding-based)
    • Keyword Search (TF-IDF)
  4. Click Submit
  5. View:
    • The raw retrieved documents
    • The AI-generated summary

You can clear the inputs using the Clear button.


Sample Queries

  • When can authorities transfer my data to another country?
  • How long is a company allowed to store my data?
  • Can data be used for automated profiling?
  • Who can access personal data during an investigation?
  • Is law enforcement allowed to forward my data to other countries?
  • GDPR Article 10
  • Lawfulness of processing

Models

  1. Embedding Model: sentence-transformers/all-MiniLM-L6-v2

    • Lightweight sentence transformer model that converts text into numerical embeddings.
    • Captures semantic meaning, allowing the app to find documents relevant to the query even if the exact words are not present.
    • Used for semantic search with FAISS.
  2. Summarization Model: pszemraj/led-large-book-summary

    • Transformer-based model fine-tuned for long-document summarization.
    • Takes the retrieved documents and generates a concise, human-readable summary that answers the user’s query.
    • Ensures summaries focus only on the provided text and do not hallucinate information.

Retrieval Methods

The app offers two ways to find relevant documents:

  1. Semantic Search (Embedding-based)

    • Converts your query into a numerical vector using a pre-trained sentence embedding model (all-MiniLM-L6-v2).
    • Uses FAISS (Facebook AI Similarity Search), a library for fast similarity search, to efficiently search a precomputed embedding index of all documents.
    • The FAISS index stores each document as a vector, allowing the app to efficiently find the top 5 documents whose embeddings are closest to the query.
  2. Keyword Search (TF-IDF-based)

    • Converts both your query and all documents into TF-IDF vectors.
    • Uses a TF-IDF index and computes cosine similarity between the query and each document. Very fast for this application with small corpus size.
    • Returns the top 5 documents with the highest similarity scores.

Repository Files

  • app.py – Main Gradio app for querying the documents and generating AI summaries.
  • embeddings.npy – Precomputed embeddings of all documents for semantic search.
  • extract_documents.py – Parses GDPR and LED PDF articles into legal_corpus.jsonl.
  • faiss_index.bin – FAISS index built from the precomputed embeddings for fast similarity search.
  • legal_corpus.jsonl – JSONL file containing all documents with a content field.
  • precompute_embeddings.py – Generates embeddings.npy and faiss_index.bin from the corpus.
  • README.md – Project documentation.
  • requirements.txt – Python dependencies required to run the app.

General Notes

  • The app currently supports GDPR (General Data Protection Regulation) and LED (Law Enforcement Directive) legal documents.
  • GDPR is the EU regulation governing data protection and privacy for individuals within the European Union.
  • LED (Law Enforcement Directive) regulates the processing of personal data by EU law enforcement authorities for the prevention, investigation, detection, or prosecution of criminal offenses.
  • Summaries are based only on the retrieved documents; the model does not generate external information.

Notes on Hugging Face Free Inference API

This project uses the free-tier Hugging Face Inference API to run pszemraj/led-large-book-summary

The free tier is CPU-only and shared by all users. Because of this, you may occasionally encounter:

  • 504 Gateway Timeout
  • 503 Service Unavailable

These issues usually mean the backend is overloaded or the request took too long.

Tips

  • Wait a few minutes and try again
  • Try a different query if yours retrieved very long documents