---
title: Agllm Public
emoji: 🦀
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.28.3
python_version: "3.9"
app_file: app.py
pinned: false
license: apache-2.0
---

## PestIDBot - Quick Reference

### Environment
```bash
# Conda environment: agllm-june-15
# Python path: /Users/muhammadarbabarshad/miniconda3/envs/agllm-june-15/bin/python
source ~/miniconda3/etc/profile.d/conda.sh && conda deactivate && conda activate agllm-june-15

# Required env vars (in .env file)
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...  # optional, for Claude models
OPENROUTER_API_KEY=...        # optional, for Llama/Gemini
```

### Key Commands
| Task | Command |
|------|---------|
| Build DB | `python app_database_prep.py` |
| Run Eval | `python retrieval_evaluation.py` |
| Run App | `python app.py` |
| Run Examples | `python app.py --examples` (saves JSON to `examples-work-out/`) |
| Deploy Dev | `git push space3 fresh-start:main` |
| Deploy Prod | `git push space2 fresh-start:main` |

### Example Vignettes → Paper Workflow

1. `python app.py --examples` → runs the 13 vignettes in `run_examples()` (6 single-model + 7 multi-model) and writes `examples-work-out/example_results_{timestamp}.json`.
2. `python examples-work-out/generate_latex.py` → reads the latest JSON, converts each result to a LaTeX fragment (with a GPT-4.1 generated commentary), and writes one `.tex` per example into `writing/65d4fadc59fceb1a54d1aae6/examples/`. `main.tex` includes them via `\input{examples/<id>}`. To change what runs, edit the `examples` list in `app.py::run_examples()`.

### Git Remotes
- `space2` → `git@hf.co:spaces/arbabarshad/agllm2` (production)
- `space3` → `git@hf.co:spaces/arbabarshad/agllm2-dev` (dev)

### Project Structure
```
├── app.py                      # Main Gradio app (deployed)
├── app_database_prep.py        # Builds ChromaDB from PDFs + Excel
├── retrieval_evaluation.py     # Runs 4-filter evaluation
├── retrieval_evaluation_results.json  # Eval metrics output
│
├── agllm-data/
│   ├── agllm-data-isu-field-insects-all-species/
│   │   ├── *.pdf               # Insect IPM documents
│   │   └── matched_species_results_v2.csv  # Species metadata
│   ├── agllm-data-isu-field-weeds-all-species/
│   │   ├── *.pdf               # Weed IPM documents
│   │   └── matched_species_results_v2.csv  # Species metadata
│   └── PestID Species.xlsx     # India & Africa data (sheets)
│
├── vector-databases-deployed/
│   └── db5-agllm-data-isu-field-insects-all-species/  # ChromaDB output
│
├── species-organized/          # Analysis scripts & outputs
│   ├── species_analysis.py     # Generates paper Figure 3
│   └── species_table.tex       # LaTeX species table
│
├── examples-work-out/
│   ├── generate_latex.py       # JSON → standalone LaTeX doc (not used by main.tex)
│   ├── example_results_*.json  # Generated example outputs
│   ├── example_results_20260303_110427.json  # All 4 vignettes (SLF, Striga, FAW, Sahara Mustard) — latest, updated prompt
│   └── species-verification.md # Vignette tracking: species, regions, DB status, example outputs
│
└── writing/
    └── 65d4fadc59fceb1a54d1aae6/main.tex  # Paper draft
```

### Database Build Flow (4 Geographic Tiers)

| Tier | Species | Chunks | Source |
|------|---------|--------|--------|
| Midwest USA | 82 | 388 | ISU Handbook PDFs |
| USA | 109 | ~500 | GPT-4o generated IPM |
| Africa | 35 | 70 | Expert-curated Excel |
| India | 11 | 39 | Expert-curated Excel |
| **Primary 3 (unique)** | **115** | **497** | 82+35+11 entries; 13 appear in 2 regions |
| **All 4 tiers** | **~224** | **~997** | 115 primary + 109 USA (109 overlap with primary) |

> **Species counts verified 2026-03-13.** The USA tier was previously reported as 219 due to a bug in `generate_usa_ipm_info.py`: the results JSONL accumulated across 3 separate runs, and the dedup was case-sensitive (missing capitalization variants from older Excel versions). Both issues are fixed. True count: 109 unique species confirmed present in the USA out of 115 primary-tier species. 6 primary-tier species are not found in the USA (see Supplementary of the paper).

**Midwest USA Data (82 species):**
1. PDFs loaded from `agllm-data/agllm-data-isu-field-{insects,weeds}-all-species/` (content source)
2. `matched_species_results_v2.csv` in each folder maps PDF filename → species name (metadata); region hardcoded as `"Midwest USA"` in `app_database_prep.py`
3. The `Midwest USA` sheet in the Excel is a reference copy only — **not** read by `app_database_prep.py`

**USA Data (109 species - LLM generated):**
4. Run `generate_usa_ipm_info.py` to query GPT-4o for all species (species from all 3 primary tiers, filtered to those confirmed in USA)
5. Creates `USA` sheet in `species-organized/PestID Species - Organized.xlsx` — both content (IPM Info column) and metadata are read from this sheet

**Africa/India Data (35 + 11 species):**
6. `Africa` and `India` sheets in the same Excel file provide both content (IPM Info) and metadata

**All Data:**
6. Documents chunked (512 tokens, 10 overlap)
7. Tagged with `matched_specie_X` + `region` metadata
8. Stored in ChromaDB at `vector-databases-deployed/db5-*/`

### Generate USA IPM Info (GPT-4o)
```bash
# Full run (prepare → process → parse)
export OPENAI_API_KEY="your-key-here"
python generate_usa_ipm_info.py --force

# Or run steps individually:
python generate_usa_ipm_info.py --step prepare   # Create JSONL requests
python generate_usa_ipm_info.py --step process   # Call GPT-4o API
python generate_usa_ipm_info.py --step parse     # Create Excel sheet
```

**Output:** Updates `species-organized/PestID Species - Organized.xlsx` with "USA" sheet containing 109 species present in the United States (pests + beneficials).

### Evaluation Filters (retrieval_evaluation.py)
| Filter | P@1 | P@2 | P@4 | P@8 | nDCG@1 | nDCG@2 | nDCG@4 | nDCG@8 |
|--------|-----|-----|-----|-----|--------|--------|--------|--------|
| No Filter | 0.50 | 0.72 | 0.79 | 0.87 | 0.50 | 0.64 | 0.67 | 0.70 |
| Species Only | 0.69 | 0.85 | 0.97 | **1.00** | 0.69 | 0.79 | 0.85 | 0.86 |
| Region Only | 0.53 | 0.77 | 0.84 | 0.89 | 0.53 | 0.68 | 0.72 | 0.73 |
| Species + Region | **0.74** | **0.89** | **0.99** | **1.00** | **0.74** | **0.83** | **0.88** | **0.89** |

### Adding a New Species (Full Pipeline)

```bash
# Activate environment (always deactivate first to avoid path conflicts)
source ~/miniconda3/etc/profile.d/conda.sh && conda deactivate && conda activate agllm-june-15

# Set API keys (or ensure .env file is correct — note: shell env vars override .env)
export OPENAI_API_KEY="sk-proj-..."
export OPENROUTER_API_KEY="sk-or-v1-..."

# 1. Copy PDF to the appropriate data directory
cp "Species Name.pdf" agllm-data/agllm-data-isu-field-{insects,weeds}-all-species/raw-pdfs/

# 2. Add entry to matched_species_results_v2.csv (in the same data directory)
echo "Genus species,Common Name,Genus species.pdf" >> agllm-data/agllm-data-isu-field-{insects,weeds}-all-species/matched_species_results_v2.csv

# 3. (Optional) Add to Midwest USA sheet in Excel for reference
#    Edit species-organized/PestID Species - Organized.xlsx

# 4. Regenerate USA tier IPM info (includes new species)
python generate_usa_ipm_info.py --force

# 5. Rebuild ChromaDB
python app_database_prep.py

# 6. Run examples to verify
python app.py --examples

# 7. Regenerate visualization and copy to paper figures
python species-organized/species_analysis.py
cp species-organized/species_analysis.pdf writing/65d4fadc59fceb1a54d1aae6/figures-in-making/species_analysis.pdf

# 8. Run retrieval evaluation
python retrieval_evaluation.py
```

**Important notes:**
- Always `conda deactivate && conda activate agllm-june-15` before running (avoids env path conflicts where `agllm-env1-updates-1` python gets used instead)
- Shell env vars override `.env` file — if you see `Incorrect API key` errors, check `echo $OPENAI_API_KEY`
- The `matched_species_results_v2.csv` maps PDF filenames → species names; this is the authoritative source for Midwest USA tier
- Multi-page PDFs are fully supported (all pages ingested and chunked)

---

### Verifying species counts

To reproduce the verified species counts (2026-03-13):

```bash
source ~/miniconda3/etc/profile.d/conda.sh && conda deactivate && conda activate agllm-june-15

python3 -c "
import pandas as pd
xl = 'species-organized/PestID Species - Organized.xlsx'
sheets = {s: pd.read_excel(xl, sheet_name=s) for s in ['Midwest USA','Africa','India','USA']}
for s, df in sheets.items():
    print(f'{s}: {df[\"Species\"].notna().sum()} rows, {df[\"Species\"].dropna().str.strip().str.lower().nunique()} unique')
primary = set()
for s in ['Midwest USA','Africa','India']:
    primary |= set(sheets[s]['Species'].dropna().str.strip().str.lower())
usa = set(sheets['USA']['Species'].dropna().str.strip().str.lower())
print(f'Primary 3 unique: {len(primary)}, USA unique: {len(usa)}')
print(f'In USA: {len(primary & usa)}, Not in USA: {len(primary - usa)}')
print('Not in USA:', sorted(primary - usa))
"
```

**Expected output:**
```
Midwest USA: 82 rows, 82 unique
Africa:      35 rows, 35 unique
India:       11 rows, 11 unique
USA:        109 rows, 109 unique
Primary 3 unique: 115, USA unique: 109
In USA: 109, Not in USA: 6
Not in USA: ['cleome aculeata', 'cleome houtteana', 'cleome monophylla',
             'cyperus mindorensis', 'spodoptera litura', 'trianthema triquetrum']
```

> **Bug history:** The USA sheet previously showed 219 rows due to (1) `generate_usa_ipm_info.py` being run 3× without clearing `usa_ipm_results.jsonl`, and (2) case-sensitive dedup missing capitalization variants from older Excel versions. Both fixed in `generate_usa_ipm_info.py`.

---

## Git LFS Troubleshooting Notes

This repository encountered several Git LFS issues during setup. Here's a summary for future reference:

1.  **Missing LFS Objects in History:** Initial pushes failed because the branch history contained references to LFS objects (specifically `a11f8941...` related to `db5/.../data_level0.bin`) that were no longer available locally or on the remote LFS store. Attempts to rewrite history using `git filter-branch` also failed because the rewrite process itself required fetching *other* missing LFS objects.
    *   **Resolution:** We created a clean base branch (`fresh-start`) with no history (`git checkout --orphan fresh-start`), committed a placeholder file, and pushed it forcefully to the remote (`git push -u space3 fresh-start:main --force`). This reset the remote `main` branch.

2.  **Importing State & Untracked Binaries:** We copied the desired file state from the old branch (`git checkout <old-branch> -- .`) into the clean `fresh-start` branch. However, the subsequent push failed because some binary files (e.g., `.png`) were included but weren't tracked by LFS according to the `.gitattributes` file *at that time*.
    *   **Resolution:**
        *   Added the necessary file patterns (e.g., `*.png filter=lfs ...`) to `.gitattributes`.
        *   Crucially, we had to ensure the commit correctly reflected this change. Amending wasn't sufficient. We used:
            ```bash
            # Reset the commit but keep files in working dir
            git reset HEAD~1
            # Re-stage files, forcing re-evaluation based on current .gitattributes
            git add --renormalize .
            # Commit the properly processed files
            git commit -m "Commit message"
            # Force push the corrected commit
            git push --force
            ```

3.  **Ignoring Necessary Directories:** A required directory (`vector-databases-deployed`) was unintentionally ignored via `.gitignore`.
    *   **Resolution:**
        *   Removed the corresponding line from `.gitignore`.
        *   Staged the `.gitignore` file and the previously ignored directory (`git add .gitignore vector-databases-deployed/`).
        *   Committed and pushed the changes.

**Key Takeaways:**

*   Pushing branches with problematic LFS history to a fresh remote can fail. Starting the remote with a clean, history-free branch is a workaround.
*   When adding LFS tracking for existing binary files via `.gitattributes`, ensure the commit correctly converts files to LFS pointers. `git add --renormalize .` after updating `.gitattributes` and *before* committing is often necessary.
*   Double-check `.gitignore` if expected files or directories are missing after a `git add .`.