--- title: Agllm Public emoji: 🦀 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.28.3 python_version: "3.9" app_file: app.py pinned: false license: apache-2.0 --- ## PestIDBot - Quick Reference ### Environment ```bash # Conda environment: agllm-june-15 # Python path: /Users/muhammadarbabarshad/miniconda3/envs/agllm-june-15/bin/python source ~/miniconda3/etc/profile.d/conda.sh && conda deactivate && conda activate agllm-june-15 # Required env vars (in .env file) OPENAI_API_KEY=sk-proj-... ANTHROPIC_API_KEY=sk-ant-... # optional, for Claude models OPENROUTER_API_KEY=... # optional, for Llama/Gemini ``` ### Key Commands | Task | Command | |------|---------| | Build DB | `python app_database_prep.py` | | Run Eval | `python retrieval_evaluation.py` | | Run App | `python app.py` | | Run Examples | `python app.py --examples` (saves JSON to `examples-work-out/`) | | Deploy Dev | `git push space3 fresh-start:main` | | Deploy Prod | `git push space2 fresh-start:main` | ### Example Vignettes → Paper Workflow 1. `python app.py --examples` → runs the 13 vignettes in `run_examples()` (6 single-model + 7 multi-model) and writes `examples-work-out/example_results_{timestamp}.json`. 2. `python examples-work-out/generate_latex.py` → reads the latest JSON, converts each result to a LaTeX fragment (with a GPT-4.1 generated commentary), and writes one `.tex` per example into `writing/65d4fadc59fceb1a54d1aae6/examples/`. `main.tex` includes them via `\input{examples/}`. To change what runs, edit the `examples` list in `app.py::run_examples()`. ### Git Remotes - `space2` → `git@hf.co:spaces/arbabarshad/agllm2` (production) - `space3` → `git@hf.co:spaces/arbabarshad/agllm2-dev` (dev) ### Project Structure ``` ├── app.py # Main Gradio app (deployed) ├── app_database_prep.py # Builds ChromaDB from PDFs + Excel ├── retrieval_evaluation.py # Runs 4-filter evaluation ├── retrieval_evaluation_results.json # Eval metrics output │ ├── agllm-data/ │ ├── agllm-data-isu-field-insects-all-species/ │ │ ├── *.pdf # Insect IPM documents │ │ └── matched_species_results_v2.csv # Species metadata │ ├── agllm-data-isu-field-weeds-all-species/ │ │ ├── *.pdf # Weed IPM documents │ │ └── matched_species_results_v2.csv # Species metadata │ └── PestID Species.xlsx # India & Africa data (sheets) │ ├── vector-databases-deployed/ │ └── db5-agllm-data-isu-field-insects-all-species/ # ChromaDB output │ ├── species-organized/ # Analysis scripts & outputs │ ├── species_analysis.py # Generates paper Figure 3 │ └── species_table.tex # LaTeX species table │ ├── examples-work-out/ │ ├── generate_latex.py # JSON → standalone LaTeX doc (not used by main.tex) │ ├── example_results_*.json # Generated example outputs │ ├── example_results_20260303_110427.json # All 4 vignettes (SLF, Striga, FAW, Sahara Mustard) — latest, updated prompt │ └── species-verification.md # Vignette tracking: species, regions, DB status, example outputs │ └── writing/ └── 65d4fadc59fceb1a54d1aae6/main.tex # Paper draft ``` ### Database Build Flow (4 Geographic Tiers) | Tier | Species | Chunks | Source | |------|---------|--------|--------| | Midwest USA | 82 | 388 | ISU Handbook PDFs | | USA | 109 | ~500 | GPT-4o generated IPM | | Africa | 35 | 70 | Expert-curated Excel | | India | 11 | 39 | Expert-curated Excel | | **Primary 3 (unique)** | **115** | **497** | 82+35+11 entries; 13 appear in 2 regions | | **All 4 tiers** | **~224** | **~997** | 115 primary + 109 USA (109 overlap with primary) | > **Species counts verified 2026-03-13.** The USA tier was previously reported as 219 due to a bug in `generate_usa_ipm_info.py`: the results JSONL accumulated across 3 separate runs, and the dedup was case-sensitive (missing capitalization variants from older Excel versions). Both issues are fixed. True count: 109 unique species confirmed present in the USA out of 115 primary-tier species. 6 primary-tier species are not found in the USA (see Supplementary of the paper). **Midwest USA Data (82 species):** 1. PDFs loaded from `agllm-data/agllm-data-isu-field-{insects,weeds}-all-species/` (content source) 2. `matched_species_results_v2.csv` in each folder maps PDF filename → species name (metadata); region hardcoded as `"Midwest USA"` in `app_database_prep.py` 3. The `Midwest USA` sheet in the Excel is a reference copy only — **not** read by `app_database_prep.py` **USA Data (109 species - LLM generated):** 4. Run `generate_usa_ipm_info.py` to query GPT-4o for all species (species from all 3 primary tiers, filtered to those confirmed in USA) 5. Creates `USA` sheet in `species-organized/PestID Species - Organized.xlsx` — both content (IPM Info column) and metadata are read from this sheet **Africa/India Data (35 + 11 species):** 6. `Africa` and `India` sheets in the same Excel file provide both content (IPM Info) and metadata **All Data:** 6. Documents chunked (512 tokens, 10 overlap) 7. Tagged with `matched_specie_X` + `region` metadata 8. Stored in ChromaDB at `vector-databases-deployed/db5-*/` ### Generate USA IPM Info (GPT-4o) ```bash # Full run (prepare → process → parse) export OPENAI_API_KEY="your-key-here" python generate_usa_ipm_info.py --force # Or run steps individually: python generate_usa_ipm_info.py --step prepare # Create JSONL requests python generate_usa_ipm_info.py --step process # Call GPT-4o API python generate_usa_ipm_info.py --step parse # Create Excel sheet ``` **Output:** Updates `species-organized/PestID Species - Organized.xlsx` with "USA" sheet containing 109 species present in the United States (pests + beneficials). ### Evaluation Filters (retrieval_evaluation.py) | Filter | P@1 | P@2 | P@4 | P@8 | nDCG@1 | nDCG@2 | nDCG@4 | nDCG@8 | |--------|-----|-----|-----|-----|--------|--------|--------|--------| | No Filter | 0.50 | 0.72 | 0.79 | 0.87 | 0.50 | 0.64 | 0.67 | 0.70 | | Species Only | 0.69 | 0.85 | 0.97 | **1.00** | 0.69 | 0.79 | 0.85 | 0.86 | | Region Only | 0.53 | 0.77 | 0.84 | 0.89 | 0.53 | 0.68 | 0.72 | 0.73 | | Species + Region | **0.74** | **0.89** | **0.99** | **1.00** | **0.74** | **0.83** | **0.88** | **0.89** | ### Adding a New Species (Full Pipeline) ```bash # Activate environment (always deactivate first to avoid path conflicts) source ~/miniconda3/etc/profile.d/conda.sh && conda deactivate && conda activate agllm-june-15 # Set API keys (or ensure .env file is correct — note: shell env vars override .env) export OPENAI_API_KEY="sk-proj-..." export OPENROUTER_API_KEY="sk-or-v1-..." # 1. Copy PDF to the appropriate data directory cp "Species Name.pdf" agllm-data/agllm-data-isu-field-{insects,weeds}-all-species/raw-pdfs/ # 2. Add entry to matched_species_results_v2.csv (in the same data directory) echo "Genus species,Common Name,Genus species.pdf" >> agllm-data/agllm-data-isu-field-{insects,weeds}-all-species/matched_species_results_v2.csv # 3. (Optional) Add to Midwest USA sheet in Excel for reference # Edit species-organized/PestID Species - Organized.xlsx # 4. Regenerate USA tier IPM info (includes new species) python generate_usa_ipm_info.py --force # 5. Rebuild ChromaDB python app_database_prep.py # 6. Run examples to verify python app.py --examples # 7. Regenerate visualization and copy to paper figures python species-organized/species_analysis.py cp species-organized/species_analysis.pdf writing/65d4fadc59fceb1a54d1aae6/figures-in-making/species_analysis.pdf # 8. Run retrieval evaluation python retrieval_evaluation.py ``` **Important notes:** - Always `conda deactivate && conda activate agllm-june-15` before running (avoids env path conflicts where `agllm-env1-updates-1` python gets used instead) - Shell env vars override `.env` file — if you see `Incorrect API key` errors, check `echo $OPENAI_API_KEY` - The `matched_species_results_v2.csv` maps PDF filenames → species names; this is the authoritative source for Midwest USA tier - Multi-page PDFs are fully supported (all pages ingested and chunked) --- ### Verifying species counts To reproduce the verified species counts (2026-03-13): ```bash source ~/miniconda3/etc/profile.d/conda.sh && conda deactivate && conda activate agllm-june-15 python3 -c " import pandas as pd xl = 'species-organized/PestID Species - Organized.xlsx' sheets = {s: pd.read_excel(xl, sheet_name=s) for s in ['Midwest USA','Africa','India','USA']} for s, df in sheets.items(): print(f'{s}: {df[\"Species\"].notna().sum()} rows, {df[\"Species\"].dropna().str.strip().str.lower().nunique()} unique') primary = set() for s in ['Midwest USA','Africa','India']: primary |= set(sheets[s]['Species'].dropna().str.strip().str.lower()) usa = set(sheets['USA']['Species'].dropna().str.strip().str.lower()) print(f'Primary 3 unique: {len(primary)}, USA unique: {len(usa)}') print(f'In USA: {len(primary & usa)}, Not in USA: {len(primary - usa)}') print('Not in USA:', sorted(primary - usa)) " ``` **Expected output:** ``` Midwest USA: 82 rows, 82 unique Africa: 35 rows, 35 unique India: 11 rows, 11 unique USA: 109 rows, 109 unique Primary 3 unique: 115, USA unique: 109 In USA: 109, Not in USA: 6 Not in USA: ['cleome aculeata', 'cleome houtteana', 'cleome monophylla', 'cyperus mindorensis', 'spodoptera litura', 'trianthema triquetrum'] ``` > **Bug history:** The USA sheet previously showed 219 rows due to (1) `generate_usa_ipm_info.py` being run 3× without clearing `usa_ipm_results.jsonl`, and (2) case-sensitive dedup missing capitalization variants from older Excel versions. Both fixed in `generate_usa_ipm_info.py`. --- ## Git LFS Troubleshooting Notes This repository encountered several Git LFS issues during setup. Here's a summary for future reference: 1. **Missing LFS Objects in History:** Initial pushes failed because the branch history contained references to LFS objects (specifically `a11f8941...` related to `db5/.../data_level0.bin`) that were no longer available locally or on the remote LFS store. Attempts to rewrite history using `git filter-branch` also failed because the rewrite process itself required fetching *other* missing LFS objects. * **Resolution:** We created a clean base branch (`fresh-start`) with no history (`git checkout --orphan fresh-start`), committed a placeholder file, and pushed it forcefully to the remote (`git push -u space3 fresh-start:main --force`). This reset the remote `main` branch. 2. **Importing State & Untracked Binaries:** We copied the desired file state from the old branch (`git checkout -- .`) into the clean `fresh-start` branch. However, the subsequent push failed because some binary files (e.g., `.png`) were included but weren't tracked by LFS according to the `.gitattributes` file *at that time*. * **Resolution:** * Added the necessary file patterns (e.g., `*.png filter=lfs ...`) to `.gitattributes`. * Crucially, we had to ensure the commit correctly reflected this change. Amending wasn't sufficient. We used: ```bash # Reset the commit but keep files in working dir git reset HEAD~1 # Re-stage files, forcing re-evaluation based on current .gitattributes git add --renormalize . # Commit the properly processed files git commit -m "Commit message" # Force push the corrected commit git push --force ``` 3. **Ignoring Necessary Directories:** A required directory (`vector-databases-deployed`) was unintentionally ignored via `.gitignore`. * **Resolution:** * Removed the corresponding line from `.gitignore`. * Staged the `.gitignore` file and the previously ignored directory (`git add .gitignore vector-databases-deployed/`). * Committed and pushed the changes. **Key Takeaways:** * Pushing branches with problematic LFS history to a fresh remote can fail. Starting the remote with a clean, history-free branch is a workaround. * When adding LFS tracking for existing binary files via `.gitattributes`, ensure the commit correctly converts files to LFS pointers. `git add --renormalize .` after updating `.gitattributes` and *before* committing is often necessary. * Double-check `.gitignore` if expected files or directories are missing after a `git add .`.