Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Natural_Language_Processing / pages /5_Pre-processing_of_text.py

Harika22

Rename pages/5_Pre-procesing_of_text.py to pages/5_Pre-processing_of_text.py

f5b1f4a verified about 1 year ago

raw

history blame contribute delete

12.5 kB

	import streamlit as st

	st.markdown(
	"""
	<style>
	body {
	background-color: #f9f9f9; /* Light gray background */
	font-family: 'Arial', sans-serif;
	}
	@keyframes fadeIn {
	0% { opacity: 0; transform: translateY(-20px); }
	100% { opacity: 1; transform: translateY(0); }
	}
	.title {
	text-align: center;
	color: #2c3e50; /* Deep gray-blue */
	font-size: 3rem;
	font-weight: bold;
	animation: fadeIn 1s ease-in-out;
	}
	.caption {
	text-align: center;
	font-style: italic;
	font-size: 1.2rem;
	color: #7f8c8d; /* Soft gray */
	animation: fadeIn 1.5s ease-in-out;
	}
	.section {
	font-size: 1.1rem;
	text-align: justify;
	line-height: 1.8;
	color: #34495e; /* Muted gray */
	background: #ffffff; /* White card-style background */
	padding: 20px;
	border-radius: 10px;
	box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
	animation: fadeIn 2s ease-in-out;
	margin: 10px 0;
	}
	.image-container {
	text-align: center;
	margin: 20px 0;
	animation: fadeIn 2.5s ease-in-out;
	}
	.image-container img {
	border-radius: 15px;
	box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);
	transition: transform 0.3s ease-in-out;
	}
	.image-container img:hover {
	transform: scale(1.05); /* Subtle zoom effect */
	}
	</style>
	""",
	unsafe_allow_html=True,
	)
	st.header(":blue[✨ Pre-processing of Text 🗺️]")

	st.markdown("<div class='section'>", unsafe_allow_html=True)
	st.markdown("<h2 class='title'>🔍 Transforming Raw Text</h2>", unsafe_allow_html=True)
	st.markdown("<p class='subtitle'>Convert unstructured text into a clean and structured format</p>", unsafe_allow_html=True)

	st.info("📌 We preprocess text in three key ways:\n\n✅ Cleaning - Problem-specific\n\n✅ Simple Pre-processing\n\n✅ Advanced Pre-processing")

	st.markdown("</div>", unsafe_allow_html=True)


	st.markdown("### ✨ Essential Preprocessing Techniques:")

	st.markdown("✅ Convert Text Case – Convert all words to uppercase or lowercase to maintain consistency and reduce dimensions.")
	st.markdown("✅ Handle URLs and Tags – Based on problem statement, either remove or preserve them.")
	st.markdown("✅ Mentions, Digits, Emails – Generally removed unless required by the analysis.")
	st.markdown("✅ Preserve Emojis – Emojis carry sentiment and play a crucial role in NLP tasks.")
	st.markdown("✅ Grammar Preservation – If grammar is needed, avoid removing punctuation.")

	st.success("🚀 Well-structured and clean text significantly boosts ML model performance!")


	st.markdown("<div class='section'>", unsafe_allow_html=True)
	st.markdown("<h2 class='title'>🔍 NLP Data Preprocessing</h2>", unsafe_allow_html=True)
	st.markdown("<p class='subtitle'>Transforming raw text into structured data for better ML performance</p>", unsafe_allow_html=True)


	st.success("📌 Benefits of Preprocessing:\n\n✅ Reduces dimensionality\n\n✅ Improves ML performance\n\n✅ Converts raw text into problem-specific structured data")

	st.markdown("### ✨ Essential Preprocessing Steps:")

	st.markdown(
	"""
	<div class='image-container'>
	<img src="https://cdn-uploads.huggingface.co/production/uploads/66bde9bf3c885d04498227a0/HtdtNm-UJdfN057BeKSgV.png",width=400>
	</div>
	""",
	unsafe_allow_html=True,
	)


	st.markdown("✅ Converting Text Case – Reduces dimensionality; case conversion depends on problem statement.")
	st.markdown("✅ Removing URLs, Tags, and Mentions – Retain only if required by the problem statement.")
	st.markdown("✅ Handling Emojis – Preserve or convert emoji data based on context.")
	st.markdown("✅ Expanding Contractions & Acronyms – Convert abbreviations into standard text.")
	st.markdown("✅ Stop Words Removal – Optional, useful for text simplification.")
	st.markdown("✅ Stemming & Lemmatization – Perform only if grammar is not crucial for analysis.")

	st.markdown("</div>", unsafe_allow_html=True)

	st.markdown("<h1 class='header-title'>🔍 Stemming & Lemmatization 💬</h1>", unsafe_allow_html=True)

	st.markdown(
	"""
	<div class='info-box'>
	<p>📝 In English, words are often made up of three components:</p>
	<ul>
	<li>🔹 <span class='highlight'>Prefix</span> + <span class='highlight'>Word</span> + <span class='highlight'>Suffix</span></li>
	</ul>
	<p>✅ Words without a suffix are called <span class='highlight'>Root Words</span>.</p>
	<p>✅ If a suffix is added to a root word, the resulting word is an <span class='highlight'>Inflected Word</span>:</p>
	<ul>
	<li>🛠️ <span class='highlight'>Root Word</span> + <span class='highlight'>Suffix</span> = Inflected Word</li>
	</ul>
	<p>💬 The process of removing the suffix from inflected words to get the root word is known as:</p>
	<ul>
	<li>✂️ <span class='highlight'>Stemming</span></li>
	<li>🧠 <span class='highlight'>Lemmatization</span></li>
	</ul>
	</div>
	""",
	unsafe_allow_html=True
	)

	st.markdown("<h1 class='header-title'>🌿 Stemming 🔎</h1>", unsafe_allow_html=True)


	st.markdown(
	"""
	<div class='info-box'>
	<p>📝 <span class='highlight'>Stemming</span> is the process of reducing an inflected word to its root form, known as the <span class='highlight'>stem</span>.</p>
	<ul>
	<li>🔹 <span class='highlight'>Inflected word ➝ Root word (Stem)</span></li>
	<li>⚡ The stem may not always be a valid English word.</li>
	<li>🚀 <span class='highlight'>Performance is faster</span> compared to lemmatization.</li>
	<li>⚡ It is used only for Removal.</li>
	<li>🔹 Whenever we need Retrieval system we use stemming</li>
	</ul>
	</div>
	""",
	unsafe_allow_html=True
	)

	st.markdown("<h2 class='sub-header'>📌 Types of Stemming</h2>", unsafe_allow_html=True)
	st.markdown("""
	- There are three major types of stemming techniques:
	- 🔹 Porter Stemmer 🏛️ (Rule-based, works in 5 stages)
	- 🔹 Snowball Stemmer ❄️ (Rule-base, Language adaptable)
	- 🔹 Lancaster Stemmer 🔁 (Iterative, aggressive removal)
	""")

	st.markdown("<h2 class='sub-header'>🏛️ Porter Stemmer</h2>", unsafe_allow_html=True)
	st.markdown(
	"""
	<div class='info-box'>
	<ul>
	<li>🔹 A Rule-based Algorithm for stemming.</li>
	<li>🔹 It takes a particular word which have some rule.</li>
	<li>🔹 For a particular rule it'll going on removing suffix till it reaches 5th stage until the inflection is removed.</li>
	<li>🔹 Works only for the English language.</li>
	</ul>
	</div>
	""",
	unsafe_allow_html=True
	)

	st.markdown("<h2 class='sub-header'>❄️ Snowball Stemmer</h2>", unsafe_allow_html=True)
	st.markdown(
	"""
	<div class='info-box'>
	<ul>
	<li>🔹 An advanced version of the Porter Stemmer.</li>
	<li>🔹 Can be applied to multiple languages.</li>
	</ul>
	</div>
	""",
	unsafe_allow_html=True
	)


	st.markdown("<h2 class='sub-header'>🔁 Lancaster Stemmer</h2>", unsafe_allow_html=True)
	st.markdown(
	"""
	<div class='info-box'>
	<ul>
	<li>🔹 An Iterative Algorithm for stemming.</li>
	<li>🔹 Removes suffixes in multiple iterations.</li>
	<li>⚠️ More aggressive removal, which might result in non-English words.</li>
	</ul>
	</div>
	""",
	unsafe_allow_html=True
	)

	st.markdown("<h1 class='header-title'>📖 Lemmatization 🔎</h1>", unsafe_allow_html=True)

	st.markdown(
	"""
	<div class='info-box'>
	<p>📝 <span class='highlight'>Lemmatization</span> is the process of reducing an inflected word to its root form, known as the <span class='highlight'>lemma</span>.</p>
	<ul>
	<li>🔹 <span class='highlight'>Inflected word ➝ Root word (Lemma)</span></li>
	<li>✅ The lemma is always an actual English word.</li>
	<li>🐢 <span class='highlight'>Performance is slower</span> than stemming.</li>
	<li>🔍 Both removal & dictionary-based checking are performed.</li>
	<li>📝 Used when we need to preserve grammar in text.</li>
	</ul>
	</div>
	""",
	unsafe_allow_html=True
	)

	st.markdown("<h2 class='sub-header'>📚 WordNet Lemmatizer</h2>", unsafe_allow_html=True)

	st.markdown(
	"""
	<div class='info-box'>
	<ul>
	<li>🔹 Takes an inflected word as input.</li>
	<li>🗄️ Searches in a huge dictionary (WordNet) containing millions of English words.</li>
	<li>🔄 Iteratively removes suffixes & checks:</li>
	<ul>
	<li>✔️ If it's an actual English word, it continues removing more suffixes.</li>
	<li>❌ If it's not an English word, the last valid root word is returned as the lemma.</li>
	</ul>
	</ul>
	</div>
	""",
	unsafe_allow_html=True
	)

	st.code('''
	from nltk.corpus import stopwords
	from nltk.stem import PorterStemmer,LancasterStemmer,SnowballStemmer,WordNetLemmatizer
	from nltk.tokenize import sent_tokenize,word_tokenize

	def pre_process(data,col,case="lower",tags=True,url=True,mail=True,mentions=True,digits=True,dates=True,emojis=True,contraction=True,stopwordss=True,inflection="stem",stemmer="porter",punc=True):
	stp = stopwords.words("english")
	stp.remove("not")
	ps = PorterStemmer()
	ls = LancasterStemmer()
	sb = SnowballStemmer(language="english")
	wl = WordNetLemmatizer()

	## emoji
	if emojis==True:
	data[col] = data[col].apply(lambda x:emoji.demojize(x,delimiters=('','')))
	else:
	pass

	## case
	if case == "lower":
	data[col]=data[col].str.lower()
	elif case == "upper":
	data[col]=data[col].str.upper()
	else:
	pass

	## tags
	if tags==True:
	data[col] = data[col].apply(lambda x:re.sub("<.*?>"," ",x))
	else:
	pass

	## urls
	if url ==True:
	data[col] = data[col].apply(lambda x:re.sub("https://\S+"," ",x))
	else:
	pass

	## mails
	if mail ==True:
	data[col] = data[col].apply(lambda x:re.sub("\S+@\S+"," ",x))
	else:
	pass

	## mentions
	if mentions ==True:
	data[col] = data[col].apply(lambda x:re.sub("\B[@#]\S+"," ",x))
	else:
	pass

	## digits
	if mentions ==True:
	data[col] = data[col].apply(lambda x:re.sub("\d"," ",x))
	else:
	pass

	## dates
	if dates==True:
	data[col] = data[col].apply(lambda x:re.sub(r"^[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}$"," ",x))
	data[col] = data[col].apply(lambda x:re.sub(r"^[0-9]{4}\/[0-9]{1,2}\/[0-9]{1,2}$"," ",x))
	else:
	pass

	## contractions
	if contraction==True:
	data[col]= data[col].apply(lambda x:contractions.fix(x))
	else:
	pass

	## punctuations
	if punc == True:
	data[col]=data[col].apply(lambda x:re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{\|}~]'," ",x))
	else:
	pass

	return data
	''')

	st.markdown('''
	- It'll give the pre-processed text data
	- We'll get the clean processed data on which we can perform feature engineering
	''')