NetuArk Posts Classifier (Ensemble Architecture)

This model is a ensemble classifier designed to categorize technology-related social media posts into their respective news sources. The model is trained to classify the following sources: - ArsTechnica - FT - GuardianTech - HackerNews - Slashdot - TechCrunch - TheVerge

Model Details

Architecture: Voting Classifier (Multinomial Naive Bayes + Logistic Regression)
Vectorization: TF-IDF (N-grams 1-3)
Accuracy: 94.81% on the NetuArk-6000 dataset.
Classes: HackerNews, TechCrunch, TheVerge, FT, GuardianTech, Slashdot, ArsTechnica.

Training Data

Trained on the Xerv-AI/netuark-posts-6000 dataset.

Usage

import joblib
import os
from huggingface_hub import hf_hub_download

# Define the missing custom function required by the unpickler
def advanced_clean(text):
    return text

# Assign it to __main__ to ensure joblib can find it during loading
import __main__
__main__.advanced_clean = advanced_clean

# Repository and filename
repo_id = 'Phase-Technologies/netuark-classifier-ensemble'
filename = 'netuark_ensemble_classifier.joblib'

try:
    # Download the file from Hugging Face
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)

    # Load the model
    model = joblib.load(file_path)
    prediction = model.predict(["📰 Perplexity's 'Personal Computer' Lets AI Agents Access Your Local Files #slashdot"])
    print(f"Prediction: {prediction}")
except Exception as e:
    import traceback
    print(f"An error occurred: {e}")
    traceback.print_exc()

Downloads last month: -

Dataset used to train Phase-Technologies/netuark-classifier-ensemble

Evaluation results

accuracy on netuark-posts-6000
self-reported

93.750