Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Natural_Language_Processing / pages /7_Advance_vectorization_techniques.py

Harika22

Update pages/7_Advance_vectorization_techniques.py

e1792cc verified about 1 year ago

raw

history blame contribute delete

18.4 kB

	import streamlit as st

	st.markdown("""
	<style>
	/* Set a soft background color */
	body {
	background-color: #eef2f7;
	}
	/* Style for main title */
	h1 {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 700;
	text-align: center;
	margin-bottom: 25px;
	}
	/* Style for headers */
	h2 {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 600;
	margin-top: 30px;
	}

	/* Style for subheaders */
	h3 {
	color: red;
	font-family: 'Roboto', sans-serif;
	font-weight: 500;
	margin-top: 20px;
	}
	.custom-subheader {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 600;
	margin-bottom: 15px;
	}
	/* Paragraph styling */
	p {
	font-family: 'Georgia', serif;
	line-height: 1.8;
	color: black;
	margin-bottom: 20px;
	}
	/* List styling with checkmark bullets */
	.icon-bullet {
	list-style-type: none;
	padding-left: 20px;
	}
	.icon-bullet li {
	font-family: 'Georgia', serif;
	font-size: 1.1em;
	margin-bottom: 10px;
	color: black;
	}
	.icon-bullet li::before {
	content: "◆";
	padding-right: 10px;
	color: black;
	}
	/* Sidebar styling */
	.sidebar .sidebar-content {
	background-color: #ffffff;
	border-radius: 10px;
	padding: 15px;
	}
	.sidebar h2 {
	color: #495057;
	}
	.step-box {
	font-size: 18px;
	background-color: #F0F8FF;
	padding: 15px;
	border-radius: 10px;
	box-shadow: 2px 2px 8px #D3D3D3;
	line-height: 1.6;
	}
	.box {
	font-size: 18px;
	background-color: #F0F8FF;
	padding: 15px;
	border-radius: 10px;
	box-shadow: 2px 2px 8px #D3D3D3;
	line-height: 1.6;
	}
	.title {
	font-size: 26px;
	font-weight: bold;
	color: #E63946;
	text-align: center;
	margin-bottom: 15px;
	}
	.formula {
	font-size: 20px;
	font-weight: bold;
	color: #2A9D8F;
	background-color: #F7F7F7;
	padding: 10px;
	border-radius: 5px;
	text-align: center;
	margin-top: 10px;
	}
	/* Custom button style */
	.streamlit-button {
	background-color: #00FFFF;
	color: #000000;
	font-weight: bold;
	}
	</style>
	""", unsafe_allow_html=True)

	st.header("Vectorization🧭")
	st.markdown(
	"""
	<div class='info-box'>
	<p>Vectorization is the process of converting text into vector.</p>
	<p>This allows ML models to process text data effectively.</p>
	</div>
	""",
	unsafe_allow_html=True
	)

	st.markdown("""
	There are advance vectorization techniques.They are :
	<ul class="icon-bullet">
	<li>Word Embedding </li>
	<li>Word2Vec </li>
	<li>Fasttext</li>
	</ul>
	""", unsafe_allow_html=True)

	st.sidebar.title("Navigation 🧭")
	file_type = st.sidebar.radio(
	"Choose a Vectorization technique :",
	("Word2Vec", "Fasttext"))

	st.header("Word Embedding Technique")
	st.markdown('''
	- It is a advanced vectorization technique it converts text into vectors in such a way that it preserves semantic meaning
	- All the techniques which preserves semantic meaning while converting text into vector is word embedding technique
	- There are 2 word embedding techniques:
	- Word2Vec
	- Fasttext
	''')

	if file_type == "Word2Vec":
	st.title(":red[Word2Vec]")
	st.markdown(
	"""
	<h3 style='color: #6A0572;'>📌 How Word2Vec Works?</h3>
	<ul>
	<li>After <strong>training</strong>, we obtain the final <span class='highlight'>Word2Vec model</span></li>
	<li>The model stores a <strong>dictionary</strong> with word-vector pairs:</li>
	</ul>
	<pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
	{ w1: [v1], w2: [v2], w3: [v3] }
	</pre>
	""",
	unsafe_allow_html=True,
	)
	st.markdown(
	"""
	<h3 style='color: #6A0572;'>⚙️ Training vs. Test Time</h3>
	<ul>
	<li><strong>Training Time</strong>: <span class='highlight'>Corpus + Deep Learning Algorithm</span> → Generates Model</li>
	<li><strong>Test Time</strong>: <span class='highlight'>Word</span> → Looked up in Dictionary → Returns <span class='highlight'>Vector Representation</span></li>
	</ul>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'>🔍 How Does It Preserve Meaning?</h3>
	<ul>
	<li>It learns from the <strong>context</strong> of words in the <span class='highlight'>corpus</span></li>
	<li>When given a word, it checks in the dictionary and retrieves the <strong>semantic vector</strong></li>
	<li>Unlike other models, <span class='highlight'>dimensions are not words</span>, but their meanings</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'>📚 Why is Corpus Important?</h3>
	<ul>
	<li>The <strong>Word2Vec algorithm</strong> is completely dependent on the corpus</li>
	<li>Better corpus → Better word representation</li>
	<li>It <strong>preserves semantic meaning</strong> using neighborhood words (context)</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)
	st.markdown('''
	- Word2Vec is not converting document into vector, it is converting word to vector
	- There are 2 techniques by using which we can convert entire document into vector
	- They are :
	- Average Word2Vec
	- TIF-IDF Word2Vec
	''')

	st.subheader(":blue[Average Word2Vec]")
	st.markdown(
	"""
	<h3 style='color: #6A0572;'>📌 Step-by-Step Process</h3>
	<ul>
	<li>Given a document <span class='highlight'>d1</span>: <strong>w1, w2, w3</strong></li>
	<li>Retrieve vector representations <strong>v1, v2, v3</strong> from Word2Vec</li>
	<li>Perform <span class='highlight'>element-wise addition</span> of vectors:
	<pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
	v_total = v1 + v2 + v3
	</pre>
	</li>
	<li>Normalize by dividing by the total number of words (element-wise division):
	<pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
	v_avg = v_total / len(d1)
	</pre>
	</li>
	<li>Final representation contains the <span class='highlight'>average meaning</span> of all words</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'>⚠️ Problem: Equal Importance to Every Word</h3>
	<ul>
	<li>Word2Vec assigns <span class='highlight'>equal weight</span> to all words</li>
	<li>No emphasis on <strong>important words</strong> that carry significant meaning</li>
	<li>This limits the effectiveness in understanding <span class='highlight'>word importance</span></li>
	</ul>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<strong>Word2Vec averages word meanings, but lacks weightage for important words! </strong>
	""",
	unsafe_allow_html=True,
	)

	st.subheader(":blue[TF-IDF Word2Vec]")
	st.markdown(
	"""
	<h3 style='color: #6A0572;'>⚠️ Issue with Word2Vec</h3>
	<ul>
	<li>Gives equal importance to every word</li>
	<li>Even words that appear frequently in a document but rarely in the corpus get equal weight</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'>🚀 Solution: Adding Weightage</h3>
	<ul>
	<li>Consider a document with 3 words: <strong>w1, w2, w3</strong></li>
	<li>Each word has a vector representation:
	<pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
	w1 → v1, w2 → v2, w3 → v3
	</pre>
	</li>
	<li>We use <span class='highlight'>two models</span>:
	<ul>
	<li><strong>TF-IDF</strong> → Computes weightage for each word</li>
	<li><strong>Word2Vec</strong> → Converts words into vectors</li>
	</ul>
	</li>
	<li>For each word, multiply its TF-IDF value with its vector</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<strong>Final Weighted Representation:</strong>
	<pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
	v_final = (TF-IDF(w1) * v1 + TF-IDF(w2) * v2 + TF-IDF(w3) * v3)
	/ (TF-IDF(w1) + TF-IDF(w2) + TF-IDF(w3))
	</pre>
	""",
	unsafe_allow_html=True,
	)
	st.subheader("How to train our own W2V model")
	st.markdown('''
	- At training time Corpus + W2V algorithm can be implemented by 2 techniques
	- They are:
	- Skip-gram
	- CBOW
	''')

	st.subheader(":red[CBOW]")
	st.markdown(
	"""
	<div class='box'>
	<h3 style='color: #6A0572;'>What is CBOW?</h3>
	<p><strong>CBOW (Continuous Bag of Words)</strong> is a technique where we use surrounding words (context) to predict the target word (focus word).</p>
	</div>
	""",
	unsafe_allow_html=True,
	)
	st.markdown(
	"""
	<h3 style='color: #6A0572;'>📂 Example Corpus</h3>
	<ul>
	<li><strong>d1:</strong> w1, w2, w3, w4, w5, w4</li>
	<li><strong>d2:</strong> w3, w4, w5, w2, w1, w2, w3, w4</li>
	</ul>
	<p>We first preprocess the data to extract meaningful relationships.</p>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'>📌 Steps to Process the Data</h3>
	<ul>
	<li>Create a <span class='highlight'>vocabulary</span> from the entire corpus: <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">{w1, w2, w3, w4, w5}</pre></li>
	<li>Generate a <strong>tabular dataset</strong> with:
	<ul>
	<li><strong>Feature variables (Context Words)</strong></li>
	<li><strong>Class variables (Target Words)</strong></li>
	</ul>
	</li>
	<li>Apply a <span class='highlight'>window size</span> of 2 (how many neighbors we consider).</li>
	<li>Slide the window over the text with <span class='highlight'>slide = 1</span>.</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'> Handling Variable Context Length</h3>
	<ul>
	<li>To ensure a consistent feature length, we use <strong>zero-padding</strong> when needed.</li>
	<li>The model tries to understand relationships based on the surrounding <span class='highlight'>context words</span>.</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)
	st.markdown(
	"""
	<strong>Mathematical Representation:</strong>
	<pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
	y = f(xi)
	where,
	y = Focus Word (Target)
	xi = Context Words (Neighbors)
	</pre>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'> Training with Artificial Neural Networks</h3>
	<p>The tabular data is passed to an <strong>Artificial Neural Network (ANN)</strong> which learns:</p>
	<ul>
	<li>How <span class='highlight'>context words</span> are related to <span class='highlight'>focus words</span>.</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)

	st.subheader(":red[Skipgram]")
	st.markdown(
	"""
	<div class='box'>
	<h3 style='color: #6A0572;'>What is Skipgram?</h3>
	<p><strong>Skipgram</strong> is a technique where we use focus words to predict the context words.</p>
	</div>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'>📂 Example Corpus</h3>
	<ul>
	<li><strong>d1:</strong> w1, w2, w3, w4, w5, w4</li>
	<li><strong>d2:</strong> w3, w4, w5, w2, w1, w2, w3, w4</li>
	</ul>
	<p>We first preprocess the data to extract meaningful relationships.</p>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'>📌 Steps to Process the Data</h3>
	<ul>
	<li>Create a <span class='highlight'>vocabulary</span> from the entire corpus: <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">{w1, w2, w3, w4, w5}</pre></li>
	<li>Generate a <strong>tabular dataset</strong> with:
	<ul>
	<li><strong>Feature variables (Focus Words)</strong></li>
	<li><strong>Class variables (Context Words)</strong></li>
	</ul>
	</li>
	<li>Apply a <span class='highlight'>window size</span> of 2 (how many neighbors we consider).</li>
	<li>Slide the window over the text with <span class='highlight'>slide = 1</span>.</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'> Handling Variable Context Length</h3>
	<ul>
	<li>To ensure a consistent feature length, we use <strong>zero-padding</strong> when needed.</li>
	<li>The model tries to understand relationships<span class='highlight'>focus words</span>.</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<strong>Mathematical Representation:</strong>
	<pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
	y = f(xi)
	where,
	y = Context Word
	xi = Focus Words
	</pre>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'> Training with Artificial Neural Networks</h3>
	<p>The tabular data is passed to an <strong>Artificial Neural Network (ANN)</strong> which learns:</p>
	<ul>
	<li>How <span class='highlight'>focus words</span> are related with <span class='highlight'>context words</span>.</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)


	elif file_type == "Fasttext":
	st.title(":red[Fasttext]")
	st.markdown(
	"""
	<p><strong>FastText</strong> is an advanced word vectorization technique that enhances word embeddings by considering subword information.</p>
	<p>It is a <span class='highlight'>simple extension</span> of Word2Vec, which converts words into vectors.</p>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'> Implementing FastText</h3>
	<p>FastText can be implemented using:</p>
	<ul>
	<li><strong>CBOW (Continuous Bag of Words)</strong></li>
	<li><strong>Skip-gram</strong></li>
	</ul>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<strong>CBOW Representation:</strong>
	<pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
	y = f(xi)
	where,
	y = Focus Word
	xi = Context Words
	</pre>

	<strong>Skip-gram Representation:</strong>
	<pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
	y = f(xi)
	where,
	y = Context Words
	xi = Focus Word
	</pre>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'> Problem: Out-of-Vocabulary (OOV)</h3>
	<p>Traditional word embedding techniques fail when encountering new or rare words.</p>
	<p><span class='highlight'>FastText overcomes this issue</span> by breaking words into subword units (character n-grams).</p>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'>Implementing CBOW with Character N-Grams</h3>
	<ul>
	<li><span class='highlight'>Window Size</span>: 5</li>
	<li><span class='highlight'>Window</span>: 2</li>
	<li><span class='highlight'>Slide</span>: 1</li>
	</ul>
	<p>A tabular format is created with <strong>context words</strong> and <strong>focus words</strong>.</p>
	""",
	unsafe_allow_html=True,
	)
	st.markdown(
	"""
	## Example Sentences:
	- d1: "apple is good for health"
	- d2: "biryani is not good for health"

	This application creates a table for context words and focus words using character 2-grams.
	"""
	)

	st.markdown('''
	-Character 2-Gram Table:

	- "Context Words": ["ap", "pp", "pl", "le", "is"]

	- "Focus Words": ["go", "oo", "od"]
	''')

	st.markdown(
	"""
	- This representation provides an average 2D vector for words.
	"""
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'>Vocabulary</h3>
	<p>The vocabulary consists of <span class='highlight'>unique character n-grams</span>.</p>
	<pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
	{ keys: values }
	where,
	- Keys: Character n-grams
	- Values: Vector representations
	</pre>
	""",
	unsafe_allow_html=True,
	)

	st.markdown(
	"""
	<h3 style='color: #6A0572;'> FastText Model</h3>
	<ul>
	<li>The dictionary created is the <span class='highlight'>FastText model</span>.</li>
	<li>Text is broken down into <strong>character n-grams</strong> to generate vector representations.</li>
	<li>It follows <span class='highlight'>element-wise addition</span>, giving an <strong>average 2D representation</strong> of the word.</li>
	</ul>
	""",
	unsafe_allow_html=True,
	)