Keyless search: DuckDuckGo + direct HTTP browse; optional Serper/Jina
Browse files- .gitattributes +2 -0
- README.md +15 -2
- app.py +16 -2
- dotenv.example +3 -1
- requirements.txt +1 -0
- scripts/entrypoint.sh +10 -0
- scripts/verify_env.py +3 -2
- space_gen.py +16 -1
- space_health.py +95 -0
- tests/__pycache__/test_imports.cpython-313-pytest-9.0.2.pyc +0 -0
- vendor/rllm/vision_deepresearch_async_workflow/tools/gen_jina_browse_impl.py +79 -0
- vendor/rllm/vision_deepresearch_async_workflow/tools/gen_universal_image_search_impl.py +61 -3
- vendor/rllm/vision_deepresearch_async_workflow/tools/gen_web_tools.py +106 -26
.gitattributes
CHANGED
|
@@ -1,2 +1,4 @@
|
|
| 1 |
# Linux containers require LF in shell scripts (CRLF causes: env: 'bash\r': No such file)
|
| 2 |
*.sh text eol=lf
|
|
|
|
|
|
|
|
|
| 1 |
# Linux containers require LF in shell scripts (CRLF causes: env: 'bash\r': No such file)
|
| 2 |
*.sh text eol=lf
|
| 3 |
+
vendor/rllm/docs/assets/rllm_components.png filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
vendor/rllm/docs/assets/sdk_arch.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -25,8 +25,11 @@ Configure in the Space **Settings → Variables and secrets** (or a mounted `.en
|
|
| 25 |
|
| 26 |
| Variable | Purpose |
|
| 27 |
|----------|---------|
|
| 28 |
-
| `SERPER_KEY_ID` | Serper API key
|
| 29 |
-
| `JINA_API_KEYS` | Jina reader
|
|
|
|
|
|
|
|
|
|
| 30 |
| `OPENAI_BASE_URL` | OpenAI-compatible base URL for GenSearcher-8B (e.g. `https://.../v1`) |
|
| 31 |
| `OPENAI_API_KEY` | API key for that endpoint (use `EMPTY` if unused) |
|
| 32 |
| `GEN_EVAL_MODEL` | Served model name (default `Gen-Searcher-8B`) |
|
|
@@ -44,6 +47,16 @@ See [`dotenv.example`](./dotenv.example) for a full template.
|
|
| 44 |
- **Minimum practical:** 1× GPU for FireRed + Gradio, with **external** vLLM endpoints for GenSearcher and browse (set `START_VLLM_GENSEARCHER=0`, `START_VLLM_BROWSE=0` — defaults).
|
| 45 |
- **Full local (as in upstream scripts):** multiple GPUs — enable `START_VLLM_GENSEARCHER=1`, `START_VLLM_BROWSE=1`, and set `GENSEARCHER_CUDA_VISIBLE_DEVICES`, `BROWSE_CUDA_VISIBLE_DEVICES`, `FIRERED_CUDA_VISIBLE_DEVICES` to disjoint GPU indices.
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
## Local build
|
| 48 |
|
| 49 |
```bash
|
|
|
|
| 25 |
|
| 26 |
| Variable | Purpose |
|
| 27 |
|----------|---------|
|
| 28 |
+
| `SERPER_KEY_ID` | Optional. Serper API key for Google web + image search. If **unset**, text and image search use **DuckDuckGo** (no key; quality and availability vary, and some datacenter IPs may be rate-limited). |
|
| 29 |
+
| `JINA_API_KEYS` | Optional. Jina reader for `r.jina.ai`. If **unset**, the visit tool uses a **direct HTTP GET** and strips HTML to text (many sites block bots or return paywalls). |
|
| 30 |
+
| `WEB_TEXT_SEARCH_PROVIDER` | Override text search: `duckduckgo` or `serper` (default is **auto**: Serper when `SERPER_KEY_ID` is set, else DuckDuckGo). |
|
| 31 |
+
| `WEB_IMAGE_SEARCH_PROVIDER` | Override image search: `duckduckgo` or `serper` (same auto rule using `SERPER_KEY_ID`). |
|
| 32 |
+
| `BROWSE_READ_ENGINE` | Override page fetch: `direct` or `jina` (default is **auto**: Jina when `JINA_API_KEYS` is set, else direct). |
|
| 33 |
| `OPENAI_BASE_URL` | OpenAI-compatible base URL for GenSearcher-8B (e.g. `https://.../v1`) |
|
| 34 |
| `OPENAI_API_KEY` | API key for that endpoint (use `EMPTY` if unused) |
|
| 35 |
| `GEN_EVAL_MODEL` | Served model name (default `Gen-Searcher-8B`) |
|
|
|
|
| 47 |
- **Minimum practical:** 1× GPU for FireRed + Gradio, with **external** vLLM endpoints for GenSearcher and browse (set `START_VLLM_GENSEARCHER=0`, `START_VLLM_BROWSE=0` — defaults).
|
| 48 |
- **Full local (as in upstream scripts):** multiple GPUs — enable `START_VLLM_GENSEARCHER=1`, `START_VLLM_BROWSE=1`, and set `GENSEARCHER_CUDA_VISIBLE_DEVICES`, `BROWSE_CUDA_VISIBLE_DEVICES`, `FIRERED_CUDA_VISIBLE_DEVICES` to disjoint GPU indices.
|
| 49 |
|
| 50 |
+
## Troubleshooting: `Connection error` / model call failed
|
| 51 |
+
|
| 52 |
+
The agent talks to your LLM over HTTP (OpenAI-compatible). A **connection error** almost always means **nothing is listening** at `OPENAI_BASE_URL`, or the URL is wrong for where the Space runs.
|
| 53 |
+
|
| 54 |
+
1. **Default `http://127.0.0.1:8002/v1`** only works if **vLLM for GenSearcher-8B** is started **inside the same container** (`START_VLLM_GENSEARCHER=1` and enough GPU). On a typical 1×GPU Space with only FireRed running, **127.0.0.1:8002 is empty** → connection error.
|
| 55 |
+
|
| 56 |
+
2. **Fix:** Set Space secret **`OPENAI_BASE_URL`** to a **reachable** HTTPS (or HTTP) base URL that ends with **`/v1`**, where you host [GenSearcher/Gen-Searcher-8B](https://huggingface.co/GenSearcher/Gen-Searcher-8B) behind vLLM, Text Generation Inference, or any OpenAI-compatible stack. The UI shows an **endpoint check** on load; use **Re-check endpoints** after you change secrets.
|
| 57 |
+
|
| 58 |
+
3. **Browse tool:** If `BROWSE_GENERATE_ENGINE=vllm`, set **`BROWSE_SUMMARY_BASE_URL`** the same way (not localhost unless you run that vLLM in-container with `START_VLLM_BROWSE=1`).
|
| 59 |
+
|
| 60 |
## Local build
|
| 61 |
|
| 62 |
```bash
|
app.py
CHANGED
|
@@ -14,6 +14,7 @@ import gradio as gr
|
|
| 14 |
from PIL import Image
|
| 15 |
|
| 16 |
from space_gen import run_sync
|
|
|
|
| 17 |
|
| 18 |
|
| 19 |
def _trajectory_to_markdown(trajectory: list) -> str:
|
|
@@ -74,9 +75,22 @@ with gr.Blocks(title="GenSearcher + FireRed") as demo:
|
|
| 74 |
"## GenSearcher + FireRed-Image-Edit-1.1\n"
|
| 75 |
"Runs the **official** GenSearcher search/browse/image-search agent (vLLM), "
|
| 76 |
"then generates with **FireRed** via the same `/generate` API as the Qwen edit server.\n\n"
|
| 77 |
-
"**
|
| 78 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
with gr.Row():
|
| 81 |
prompt = gr.Textbox(
|
| 82 |
label="Image task / prompt",
|
|
|
|
| 14 |
from PIL import Image
|
| 15 |
|
| 16 |
from space_gen import run_sync
|
| 17 |
+
from space_health import llm_endpoint_status
|
| 18 |
|
| 19 |
|
| 20 |
def _trajectory_to_markdown(trajectory: list) -> str:
|
|
|
|
| 75 |
"## GenSearcher + FireRed-Image-Edit-1.1\n"
|
| 76 |
"Runs the **official** GenSearcher search/browse/image-search agent (vLLM), "
|
| 77 |
"then generates with **FireRed** via the same `/generate` API as the Qwen edit server.\n\n"
|
| 78 |
+
"**LLM (required):** a reachable **OpenAI-compatible** URL in `OPENAI_BASE_URL` (must include `/v1`) for "
|
| 79 |
+
"[Gen-Searcher-8B](https://huggingface.co/GenSearcher/Gen-Searcher-8B), plus `BROWSE_SUMMARY_BASE_URL` when "
|
| 80 |
+
"using browse summarization with `BROWSE_GENERATE_ENGINE=vllm` (see README).\n\n"
|
| 81 |
+
"**Search / browse (optional keys):** without `SERPER_KEY_ID` and `JINA_API_KEYS`, the agent uses **DuckDuckGo** "
|
| 82 |
+
"for web and image search and **direct HTTP** page fetch for visits. Set those secrets if you prefer Serper + Jina.\n\n"
|
| 83 |
+
"**Connection errors:** On Hugging Face Spaces, `http://127.0.0.1:8002/v1` only works if you run vLLM "
|
| 84 |
+
"in the same container (`START_VLLM_GENSEARCHER=1` + GPU). Otherwise set `OPENAI_BASE_URL` to your **public** inference server."
|
| 85 |
)
|
| 86 |
+
status_md = gr.Markdown(llm_endpoint_status())
|
| 87 |
+
refresh_status = gr.Button("Re-check endpoints", size="sm")
|
| 88 |
+
|
| 89 |
+
def _refresh():
|
| 90 |
+
return llm_endpoint_status()
|
| 91 |
+
|
| 92 |
+
refresh_status.click(fn=_refresh, outputs=status_md)
|
| 93 |
+
demo.load(fn=_refresh, outputs=status_md)
|
| 94 |
with gr.Row():
|
| 95 |
prompt = gr.Textbox(
|
| 96 |
label="Image task / prompt",
|
dotenv.example
CHANGED
|
@@ -9,7 +9,9 @@ export GEN_EVAL_MODEL="Gen-Searcher-8B"
|
|
| 9 |
export QWEN_EDIT_APP_URL="http://127.0.0.1:8765"
|
| 10 |
export QWEN_EDIT_APP_PATH="/generate"
|
| 11 |
|
| 12 |
-
# Serper + Jina (
|
|
|
|
|
|
|
| 13 |
export SERPER_KEY_ID=""
|
| 14 |
export JINA_API_KEYS=""
|
| 15 |
export TEXT_SEARCH_API_BASE_URL="https://google.serper.dev/search"
|
|
|
|
| 9 |
export QWEN_EDIT_APP_URL="http://127.0.0.1:8765"
|
| 10 |
export QWEN_EDIT_APP_PATH="/generate"
|
| 11 |
|
| 12 |
+
# Optional: Serper + Jina (Google-quality search / reader proxy). If unset, tools use
|
| 13 |
+
# DuckDuckGo for text+image search and plain HTTP fetch for browse (no API keys).
|
| 14 |
+
# Force backends: WEB_TEXT_SEARCH_PROVIDER=duckduckgo|serper, WEB_IMAGE_SEARCH_PROVIDER=..., BROWSE_READ_ENGINE=direct|jina
|
| 15 |
export SERPER_KEY_ID=""
|
| 16 |
export JINA_API_KEYS=""
|
| 17 |
export TEXT_SEARCH_API_BASE_URL="https://google.serper.dev/search"
|
requirements.txt
CHANGED
|
@@ -4,3 +4,4 @@ accelerate>=0.26.0
|
|
| 4 |
gradio>=4.44.0
|
| 5 |
tiktoken>=0.7.0
|
| 6 |
uvicorn[standard]>=0.30.0
|
|
|
|
|
|
| 4 |
gradio>=4.44.0
|
| 5 |
tiktoken>=0.7.0
|
| 6 |
uvicorn[standard]>=0.30.0
|
| 7 |
+
duckduckgo-search>=6.0.0
|
scripts/entrypoint.sh
CHANGED
|
@@ -13,6 +13,16 @@ if [[ -f /app/.env.gen_image ]]; then
|
|
| 13 |
set +a
|
| 14 |
fi
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
wait_http() {
|
| 17 |
local url=$1
|
| 18 |
local name=$2
|
|
|
|
| 13 |
set +a
|
| 14 |
fi
|
| 15 |
|
| 16 |
+
if [[ "${START_VLLM_GENSEARCHER:-0}" != "1" ]]; then
|
| 17 |
+
case "${OPENAI_BASE_URL:-}" in
|
| 18 |
+
*127.0.0.1*|*localhost*)
|
| 19 |
+
echo "[entrypoint] WARNING: OPENAI_BASE_URL points to loopback but START_VLLM_GENSEARCHER is not 1."
|
| 20 |
+
echo "[entrypoint] The GenSearcher agent will get 'Connection error' unless a server listens here,"
|
| 21 |
+
echo "[entrypoint] or you set OPENAI_BASE_URL to an external OpenAI-compatible URL (ending in /v1)."
|
| 22 |
+
;;
|
| 23 |
+
esac
|
| 24 |
+
fi
|
| 25 |
+
|
| 26 |
wait_http() {
|
| 27 |
local url=$1
|
| 28 |
local name=$2
|
scripts/verify_env.py
CHANGED
|
@@ -5,8 +5,9 @@ from __future__ import annotations
|
|
| 5 |
import os
|
| 6 |
|
| 7 |
CHECKS = [
|
| 8 |
-
|
| 9 |
-
("
|
|
|
|
| 10 |
("OPENAI_BASE_URL", True),
|
| 11 |
("GEN_EVAL_MODEL", False),
|
| 12 |
("OPENAI_API_KEY", False),
|
|
|
|
| 5 |
import os
|
| 6 |
|
| 7 |
CHECKS = [
|
| 8 |
+
# Serper / Jina optional: without them the agent uses DuckDuckGo + direct HTTP fetch.
|
| 9 |
+
("SERPER_KEY_ID", False),
|
| 10 |
+
("JINA_API_KEYS", False),
|
| 11 |
("OPENAI_BASE_URL", True),
|
| 12 |
("GEN_EVAL_MODEL", False),
|
| 13 |
("OPENAI_API_KEY", False),
|
space_gen.py
CHANGED
|
@@ -12,6 +12,8 @@ from typing import Any, Dict, List, Optional, Tuple
|
|
| 12 |
|
| 13 |
import requests
|
| 14 |
|
|
|
|
|
|
|
| 15 |
from rllm.engine.agent_workflow_engine import AgentWorkflowEngine
|
| 16 |
from rllm.engine.rollout import OpenAIEngine
|
| 17 |
from vision_deepresearch_async_workflow.gen_image_deepresearch_tools_executor import (
|
|
@@ -121,11 +123,24 @@ async def run_gensearcher_then_generate(
|
|
| 121 |
}
|
| 122 |
|
| 123 |
model = os.environ.get("GEN_EVAL_MODEL", "Gen-Searcher-8B")
|
| 124 |
-
base_url = os.environ.get("OPENAI_BASE_URL", "http://127.0.0.1:8002/v1").rstrip("/")
|
| 125 |
if not base_url.endswith("/v1"):
|
| 126 |
base_url = base_url + "/v1"
|
| 127 |
api_key = os.environ.get("OPENAI_API_KEY", "EMPTY")
|
| 128 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
rollout_engine = OpenAIEngine(
|
| 130 |
model=model,
|
| 131 |
base_url=base_url,
|
|
|
|
| 12 |
|
| 13 |
import requests
|
| 14 |
|
| 15 |
+
from space_health import check_v1_models, is_localhost_url
|
| 16 |
+
|
| 17 |
from rllm.engine.agent_workflow_engine import AgentWorkflowEngine
|
| 18 |
from rllm.engine.rollout import OpenAIEngine
|
| 19 |
from vision_deepresearch_async_workflow.gen_image_deepresearch_tools_executor import (
|
|
|
|
| 123 |
}
|
| 124 |
|
| 125 |
model = os.environ.get("GEN_EVAL_MODEL", "Gen-Searcher-8B")
|
| 126 |
+
base_url = os.environ.get("OPENAI_BASE_URL", "http://127.0.0.1:8002/v1").strip().rstrip("/")
|
| 127 |
if not base_url.endswith("/v1"):
|
| 128 |
base_url = base_url + "/v1"
|
| 129 |
api_key = os.environ.get("OPENAI_API_KEY", "EMPTY")
|
| 130 |
|
| 131 |
+
ok_llm, llm_msg = check_v1_models(base_url, api_key)
|
| 132 |
+
if not ok_llm:
|
| 133 |
+
hint = ""
|
| 134 |
+
if is_localhost_url(base_url):
|
| 135 |
+
hint = (
|
| 136 |
+
" You are targeting localhost inside the Space container. Nothing is listening unless you set "
|
| 137 |
+
"Space variable START_VLLM_GENSEARCHER=1 (and GPU) or change OPENAI_BASE_URL to a reachable "
|
| 138 |
+
"OpenAI-compatible server (your vLLM / TGI URL ending in /v1)."
|
| 139 |
+
)
|
| 140 |
+
raise RuntimeError(
|
| 141 |
+
f"GenSearcher LLM is not reachable at {base_url}/models — {llm_msg}.{hint}"
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
rollout_engine = OpenAIEngine(
|
| 145 |
model=model,
|
| 146 |
base_url=base_url,
|
space_health.py
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Preflight checks for OpenAI-compatible LLM endpoints (GenSearcher + browse)."""
|
| 2 |
+
from __future__ import annotations
|
| 3 |
+
|
| 4 |
+
import os
|
| 5 |
+
from typing import Tuple
|
| 6 |
+
|
| 7 |
+
import requests
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def normalize_openai_v1_base(url: str) -> str:
|
| 11 |
+
u = (url or "").strip().rstrip("/")
|
| 12 |
+
if not u:
|
| 13 |
+
return ""
|
| 14 |
+
if not u.endswith("/v1"):
|
| 15 |
+
u = u + "/v1"
|
| 16 |
+
return u
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def check_v1_models(base_url_v1: str, api_key: str, timeout: float = 15.0) -> Tuple[bool, str]:
|
| 20 |
+
"""
|
| 21 |
+
GET {base}/models — standard OpenAI-compatible discovery (vLLM, etc.).
|
| 22 |
+
"""
|
| 23 |
+
if not base_url_v1:
|
| 24 |
+
return False, "URL is empty"
|
| 25 |
+
url = base_url_v1.rstrip("/") + "/models"
|
| 26 |
+
headers = {"Authorization": f"Bearer {api_key or 'EMPTY'}"}
|
| 27 |
+
try:
|
| 28 |
+
r = requests.get(url, headers=headers, timeout=timeout)
|
| 29 |
+
if r.status_code == 200:
|
| 30 |
+
return True, "OK"
|
| 31 |
+
return False, f"HTTP {r.status_code}: {r.text[:300]}"
|
| 32 |
+
except requests.exceptions.ConnectionError as e:
|
| 33 |
+
return False, f"Connection failed (nothing listening or blocked): {e}"
|
| 34 |
+
except requests.exceptions.Timeout:
|
| 35 |
+
return False, "Timeout — server not responding"
|
| 36 |
+
except requests.exceptions.RequestException as e:
|
| 37 |
+
return False, str(e)
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def is_localhost_url(url: str) -> bool:
|
| 41 |
+
u = (url or "").lower()
|
| 42 |
+
return "127.0.0.1" in u or "localhost" in u
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def llm_endpoint_status() -> str:
|
| 46 |
+
"""Human-readable markdown for Gradio banner."""
|
| 47 |
+
gen_base = normalize_openai_v1_base(os.environ.get("OPENAI_BASE_URL", ""))
|
| 48 |
+
gen_key = os.environ.get("OPENAI_API_KEY", "EMPTY")
|
| 49 |
+
browse_base = normalize_openai_v1_base(os.environ.get("BROWSE_SUMMARY_BASE_URL", ""))
|
| 50 |
+
browse_key = os.environ.get("BROWSE_SUMMARY_API_KEY", os.environ.get("OPENAI_API_KEY", "EMPTY"))
|
| 51 |
+
|
| 52 |
+
lines = ["### Endpoint checks", ""]
|
| 53 |
+
|
| 54 |
+
if not gen_base:
|
| 55 |
+
lines.append(
|
| 56 |
+
"**GenSearcher LLM:** `OPENAI_BASE_URL` is **not set**. "
|
| 57 |
+
"Add a Space secret pointing to an OpenAI-compatible server that serves **GenSearcher/Gen-Searcher-8B** "
|
| 58 |
+
"(e.g. your own vLLM URL ending in `/v1`)."
|
| 59 |
+
)
|
| 60 |
+
else:
|
| 61 |
+
ok, msg = check_v1_models(gen_base, gen_key)
|
| 62 |
+
if ok:
|
| 63 |
+
lines.append(f"**GenSearcher LLM** (`OPENAI_BASE_URL`): reachable — `{gen_base}`")
|
| 64 |
+
else:
|
| 65 |
+
lines.append(
|
| 66 |
+
f"**GenSearcher LLM** (`OPENAI_BASE_URL`): **unreachable** — `{gen_base}`\n\n"
|
| 67 |
+
f"- Detail: `{msg}`\n"
|
| 68 |
+
)
|
| 69 |
+
if is_localhost_url(gen_base):
|
| 70 |
+
lines.append(
|
| 71 |
+
"- You are using **localhost / 127.0.0.1**. Inside a Hugging Face Space, that is **this container only**. "
|
| 72 |
+
"Either set `START_VLLM_GENSEARCHER=1` (and enough GPU) to run vLLM here, "
|
| 73 |
+
"or set `OPENAI_BASE_URL` to a **public** inference URL (your vLLM, TGI, etc.).\n"
|
| 74 |
+
)
|
| 75 |
+
|
| 76 |
+
lines.append("")
|
| 77 |
+
if os.environ.get("BROWSE_GENERATE_ENGINE", "").strip().lower() == "vllm":
|
| 78 |
+
if not browse_base:
|
| 79 |
+
lines.append(
|
| 80 |
+
"**Browse summarizer:** `BROWSE_SUMMARY_BASE_URL` is **not set** (needed when `BROWSE_GENERATE_ENGINE=vllm`)."
|
| 81 |
+
)
|
| 82 |
+
else:
|
| 83 |
+
ok_b, msg_b = check_v1_models(browse_base, browse_key)
|
| 84 |
+
if ok_b:
|
| 85 |
+
lines.append(f"**Browse LLM:** OK — `{browse_base}`")
|
| 86 |
+
else:
|
| 87 |
+
lines.append(
|
| 88 |
+
f"**Browse LLM:** **unreachable** — `{browse_base}` — `{msg_b}`"
|
| 89 |
+
)
|
| 90 |
+
if is_localhost_url(browse_base):
|
| 91 |
+
lines.append(
|
| 92 |
+
"- Same **localhost** note: use an external Qwen3-VL server or `START_VLLM_BROWSE=1` with extra GPU.\n"
|
| 93 |
+
)
|
| 94 |
+
|
| 95 |
+
return "\n".join(lines)
|
tests/__pycache__/test_imports.cpython-313-pytest-9.0.2.pyc
ADDED
|
Binary file (3.24 kB). View file
|
|
|
vendor/rllm/vision_deepresearch_async_workflow/tools/gen_jina_browse_impl.py
CHANGED
|
@@ -11,6 +11,7 @@ for clean open-source distribution.
|
|
| 11 |
|
| 12 |
import os
|
| 13 |
import random
|
|
|
|
| 14 |
import time
|
| 15 |
from typing import Optional
|
| 16 |
|
|
@@ -31,6 +32,84 @@ def _get_jina_proxies() -> Optional[dict]:
|
|
| 31 |
return None
|
| 32 |
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
def jina_readpage(url: str, max_retry: int = 10) -> str:
|
| 35 |
"""Fetch page content via the read-proxy."""
|
| 36 |
if not requests:
|
|
|
|
| 11 |
|
| 12 |
import os
|
| 13 |
import random
|
| 14 |
+
import re
|
| 15 |
import time
|
| 16 |
from typing import Optional
|
| 17 |
|
|
|
|
| 32 |
return None
|
| 33 |
|
| 34 |
|
| 35 |
+
_BROWSE_UA = os.environ.get(
|
| 36 |
+
"BROWSE_DIRECT_UA",
|
| 37 |
+
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
| 38 |
+
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
| 39 |
+
)
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def _html_to_text(html: str, max_chars: int) -> str:
|
| 43 |
+
"""Cheap HTML→text for keyless browse (no extra dependencies)."""
|
| 44 |
+
html = re.sub(r"(?is)<script[^>]*>.*?</script>", " ", html)
|
| 45 |
+
html = re.sub(r"(?is)<style[^>]*>.*?</style>", " ", html)
|
| 46 |
+
html = re.sub(r"(?is)<noscript[^>]*>.*?</noscript>", " ", html)
|
| 47 |
+
html = re.sub(r"(?s)<[^>]+>", " ", html)
|
| 48 |
+
for a, b in (
|
| 49 |
+
(" ", " "),
|
| 50 |
+
("&", "&"),
|
| 51 |
+
("<", "<"),
|
| 52 |
+
(">", ">"),
|
| 53 |
+
(""", '"'),
|
| 54 |
+
("'", "'"),
|
| 55 |
+
):
|
| 56 |
+
html = html.replace(a, b)
|
| 57 |
+
html = re.sub(r"\s+", " ", html).strip()
|
| 58 |
+
return html[:max_chars] if max_chars > 0 else html
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def direct_readpage(url: str, max_retry: int = 10) -> str:
|
| 62 |
+
"""Fetch a URL with HTTP GET and strip HTML to plain text (no Jina / API key)."""
|
| 63 |
+
if not requests:
|
| 64 |
+
return "[browse] requests library not available."
|
| 65 |
+
|
| 66 |
+
u = (url or "").strip()
|
| 67 |
+
if not u:
|
| 68 |
+
return "[browse] Empty URL."
|
| 69 |
+
|
| 70 |
+
timeout = float(os.environ.get("BROWSE_DIRECT_TIMEOUT", "35"))
|
| 71 |
+
max_chars = int(os.environ.get("BROWSE_DIRECT_MAX_CHARS", "500000"))
|
| 72 |
+
headers = {
|
| 73 |
+
"User-Agent": _BROWSE_UA,
|
| 74 |
+
"Accept": "text/html,application/xhtml+xml;q=0.9,*/*;q=0.8",
|
| 75 |
+
"Accept-Language": "en-US,en;q=0.9",
|
| 76 |
+
}
|
| 77 |
+
|
| 78 |
+
for attempt in range(max_retry):
|
| 79 |
+
try:
|
| 80 |
+
response = requests.get(
|
| 81 |
+
u,
|
| 82 |
+
headers=headers,
|
| 83 |
+
timeout=timeout,
|
| 84 |
+
allow_redirects=True,
|
| 85 |
+
proxies=None,
|
| 86 |
+
)
|
| 87 |
+
if response.status_code == 429:
|
| 88 |
+
wait_time = 4 + random.uniform(2, 4)
|
| 89 |
+
print(
|
| 90 |
+
f"[Browse] direct_readpage 429, retrying in {wait_time:.2f}s url={u!r}",
|
| 91 |
+
flush=True,
|
| 92 |
+
)
|
| 93 |
+
time.sleep(wait_time)
|
| 94 |
+
continue
|
| 95 |
+
response.raise_for_status()
|
| 96 |
+
ct = (response.headers.get("Content-Type") or "").lower()
|
| 97 |
+
if "text/html" not in ct and "application/xhtml" not in ct and "text/plain" not in ct:
|
| 98 |
+
return f"[browse] Non-HTML response (Content-Type: {ct or 'unknown'})."
|
| 99 |
+
raw = response.content.decode(response.encoding or "utf-8", errors="replace")
|
| 100 |
+
text = _html_to_text(raw, max_chars=max_chars)
|
| 101 |
+
if not text.strip():
|
| 102 |
+
return "[browse] Empty page after stripping HTML."
|
| 103 |
+
return text
|
| 104 |
+
except Exception as e:
|
| 105 |
+
print(f"[Browse] direct_readpage attempt={attempt} url={u!r} error: {e}", flush=True)
|
| 106 |
+
if attempt == max_retry - 1:
|
| 107 |
+
return "[browse] Failed to read page."
|
| 108 |
+
time.sleep(0.5 + random.uniform(0, 1.0))
|
| 109 |
+
|
| 110 |
+
return "[browse] Failed to read page."
|
| 111 |
+
|
| 112 |
+
|
| 113 |
def jina_readpage(url: str, max_retry: int = 10) -> str:
|
| 114 |
"""Fetch page content via the read-proxy."""
|
| 115 |
if not requests:
|
vendor/rllm/vision_deepresearch_async_workflow/tools/gen_universal_image_search_impl.py
CHANGED
|
@@ -171,15 +171,73 @@ def download_image(
|
|
| 171 |
return local_path
|
| 172 |
|
| 173 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
def _fetch_universal_image_results(query: str, topk: int, max_retry: int) -> List[dict]:
|
| 175 |
"""
|
| 176 |
-
Fetch image search results via
|
| 177 |
normalizing each hit to the schema expected by _download_from_items().
|
| 178 |
-
Uses SERPER_KEY_ID as X-API-KEY.
|
| 179 |
"""
|
|
|
|
|
|
|
|
|
|
| 180 |
api_key = (os.environ.get("SERPER_KEY_ID") or "").strip()
|
| 181 |
if not api_key:
|
| 182 |
-
raise ValueError(
|
|
|
|
|
|
|
|
|
|
| 183 |
|
| 184 |
url = (os.environ.get("IMAGE_SEARCH_API_BASE_URL") or "").strip()
|
| 185 |
if not url:
|
|
|
|
| 171 |
return local_path
|
| 172 |
|
| 173 |
|
| 174 |
+
def _resolve_image_search_backend() -> str:
|
| 175 |
+
explicit = (os.environ.get("WEB_IMAGE_SEARCH_PROVIDER") or "").strip().lower()
|
| 176 |
+
if explicit in ("duckduckgo", "ddg", "free"):
|
| 177 |
+
return "duckduckgo"
|
| 178 |
+
if explicit in ("serper", "google", "api"):
|
| 179 |
+
return "serper"
|
| 180 |
+
return "serper" if (os.environ.get("SERPER_KEY_ID") or "").strip() else "duckduckgo"
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
def _fetch_ddg_image_results(query: str, topk: int, max_retry: int) -> List[dict]:
|
| 184 |
+
try:
|
| 185 |
+
from duckduckgo_search import DDGS
|
| 186 |
+
except ImportError as e:
|
| 187 |
+
raise RuntimeError("duckduckgo-search is not installed") from e
|
| 188 |
+
|
| 189 |
+
topk = min(max(1, topk), 20)
|
| 190 |
+
for retry in range(max_retry):
|
| 191 |
+
try:
|
| 192 |
+
items: List[dict] = []
|
| 193 |
+
with DDGS() as ddgs:
|
| 194 |
+
for row in ddgs.images(query, max_results=topk):
|
| 195 |
+
image_url = row.get("image") or row.get("thumbnail") or ""
|
| 196 |
+
page_url = row.get("url") or row.get("link") or ""
|
| 197 |
+
title_txt = (row.get("title") or row.get("source") or "image").strip() or "image"
|
| 198 |
+
if not image_url:
|
| 199 |
+
continue
|
| 200 |
+
items.append(
|
| 201 |
+
{
|
| 202 |
+
"title": title_txt,
|
| 203 |
+
"imageUrl": image_url,
|
| 204 |
+
"thumbnailUrl": row.get("thumbnail") or "",
|
| 205 |
+
"link": page_url,
|
| 206 |
+
"sourceUrl": "",
|
| 207 |
+
}
|
| 208 |
+
)
|
| 209 |
+
print(
|
| 210 |
+
f"[ImageSearch] DuckDuckGo results_len={len(items)} query={query!r}",
|
| 211 |
+
flush=True,
|
| 212 |
+
)
|
| 213 |
+
if items:
|
| 214 |
+
return items
|
| 215 |
+
sleep_time = random.uniform(1, 5)
|
| 216 |
+
print(f"[ImageSearch] DDG empty results, retry={retry} sleep={sleep_time:.2f}s", flush=True)
|
| 217 |
+
time.sleep(sleep_time)
|
| 218 |
+
except Exception as e:
|
| 219 |
+
print(f"[ImageSearch] _fetch_ddg_image_results retry={retry} error: {e}", flush=True)
|
| 220 |
+
if retry == max_retry - 1:
|
| 221 |
+
raise
|
| 222 |
+
time.sleep(1 + random.uniform(0, 2))
|
| 223 |
+
|
| 224 |
+
raise RuntimeError(f"DuckDuckGo image search failed after {max_retry} retries")
|
| 225 |
+
|
| 226 |
+
|
| 227 |
def _fetch_universal_image_results(query: str, topk: int, max_retry: int) -> List[dict]:
|
| 228 |
"""
|
| 229 |
+
Fetch image search results via Serper (Google) or DuckDuckGo (no API key),
|
| 230 |
normalizing each hit to the schema expected by _download_from_items().
|
|
|
|
| 231 |
"""
|
| 232 |
+
if _resolve_image_search_backend() == "duckduckgo":
|
| 233 |
+
return _fetch_ddg_image_results(query, topk, max_retry)
|
| 234 |
+
|
| 235 |
api_key = (os.environ.get("SERPER_KEY_ID") or "").strip()
|
| 236 |
if not api_key:
|
| 237 |
+
raise ValueError(
|
| 238 |
+
"SERPER_KEY_ID is not set for image search "
|
| 239 |
+
"(set WEB_IMAGE_SEARCH_PROVIDER=duckduckgo for keyless image search)"
|
| 240 |
+
)
|
| 241 |
|
| 242 |
url = (os.environ.get("IMAGE_SEARCH_API_BASE_URL") or "").strip()
|
| 243 |
if not url:
|
vendor/rllm/vision_deepresearch_async_workflow/tools/gen_web_tools.py
CHANGED
|
@@ -16,7 +16,7 @@ import os
|
|
| 16 |
import random
|
| 17 |
import re
|
| 18 |
import time
|
| 19 |
-
from typing import List, Optional, Union
|
| 20 |
|
| 21 |
try:
|
| 22 |
import requests
|
|
@@ -30,14 +30,72 @@ def _clean_html_b(text: str) -> str:
|
|
| 30 |
return re.sub(r"</?b>", "", text or "")
|
| 31 |
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
def _text_search_sync(queries: List[str], topk: int = 10, max_retry: int = 100) -> str:
|
| 34 |
-
"""Blocking web search:
|
| 35 |
if requests is None:
|
| 36 |
return "[Search] requests is not installed."
|
| 37 |
|
|
|
|
|
|
|
|
|
|
| 38 |
api_key = (os.environ.get("SERPER_KEY_ID") or "").strip()
|
| 39 |
if not api_key:
|
| 40 |
-
return "[Search] SERPER_KEY_ID is not set."
|
| 41 |
|
| 42 |
url = (os.environ.get("TEXT_SEARCH_API_BASE_URL") or "").strip()
|
| 43 |
if not url:
|
|
@@ -190,33 +248,55 @@ def _image_search_sync(
|
|
| 190 |
def _browse_sync(
|
| 191 |
url: str,
|
| 192 |
query: str,
|
| 193 |
-
read_engine: str =
|
| 194 |
generate_engine: str = "deepseekchat",
|
| 195 |
max_retry: int = 10,
|
| 196 |
) -> str:
|
| 197 |
-
"""Fetch page via
|
| 198 |
# Optional random delay to spread traffic.
|
| 199 |
if os.environ.get("BROWSE_RANDOM_SLEEP", "").strip().lower() in ("1", "true", "yes"):
|
| 200 |
time.sleep(random.uniform(0, 16))
|
| 201 |
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 220 |
print(f"[Browse] Empty or failed read for url={url!r}", flush=True)
|
| 221 |
return "Browse error. Please try again."
|
| 222 |
|
|
@@ -289,7 +369,7 @@ def _browse_sync(
|
|
| 289 |
|
| 290 |
|
| 291 |
class WebTextSearchTool(DeepResearchTool):
|
| 292 |
-
"""Text search tool (Serper
|
| 293 |
|
| 294 |
def __init__(self):
|
| 295 |
super().__init__(
|
|
@@ -373,7 +453,6 @@ class JinaBrowseTool(DeepResearchTool):
|
|
| 373 |
"required": ["url", "goal"],
|
| 374 |
},
|
| 375 |
)
|
| 376 |
-
self._read_engine = "jina"
|
| 377 |
self._generate_engine = os.environ.get("BROWSE_GENERATE_ENGINE", "deepseekchat")
|
| 378 |
self._max_retry = 10
|
| 379 |
|
|
@@ -390,13 +469,14 @@ class JinaBrowseTool(DeepResearchTool):
|
|
| 390 |
# Gen-image agent passes "query"; other callers may use "goal"
|
| 391 |
effective_goal = (goal or query or kwargs.get("query") or "").strip()
|
| 392 |
goal = effective_goal or "Detailed summary of the page."
|
|
|
|
| 393 |
results: List[str] = []
|
| 394 |
for u in urls[:5]:
|
| 395 |
r = await self._run_blocking(
|
| 396 |
lambda uu=u: _browse_sync(
|
| 397 |
url=uu,
|
| 398 |
query=goal,
|
| 399 |
-
read_engine=
|
| 400 |
generate_engine=self._generate_engine,
|
| 401 |
max_retry=self._max_retry,
|
| 402 |
)
|
|
|
|
| 16 |
import random
|
| 17 |
import re
|
| 18 |
import time
|
| 19 |
+
from typing import List, Optional, Union, Literal
|
| 20 |
|
| 21 |
try:
|
| 22 |
import requests
|
|
|
|
| 30 |
return re.sub(r"</?b>", "", text or "")
|
| 31 |
|
| 32 |
|
| 33 |
+
def _resolve_text_search_backend() -> Literal["serper", "duckduckgo"]:
|
| 34 |
+
"""Serper when key is present (unless overridden); otherwise DuckDuckGo (no API key)."""
|
| 35 |
+
explicit = (os.environ.get("WEB_TEXT_SEARCH_PROVIDER") or "").strip().lower()
|
| 36 |
+
if explicit in ("duckduckgo", "ddg", "free"):
|
| 37 |
+
return "duckduckgo"
|
| 38 |
+
if explicit in ("serper", "google", "api"):
|
| 39 |
+
return "serper"
|
| 40 |
+
return "serper" if (os.environ.get("SERPER_KEY_ID") or "").strip() else "duckduckgo"
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
def _resolve_browse_read_engine() -> Literal["jina", "direct"]:
|
| 44 |
+
"""Jina when key is present (unless overridden); otherwise plain HTTP fetch."""
|
| 45 |
+
explicit = (os.environ.get("BROWSE_READ_ENGINE") or "").strip().lower()
|
| 46 |
+
if explicit in ("direct", "fetch", "http"):
|
| 47 |
+
return "direct"
|
| 48 |
+
if explicit == "jina":
|
| 49 |
+
return "jina"
|
| 50 |
+
if explicit:
|
| 51 |
+
return "jina"
|
| 52 |
+
return "jina" if (os.environ.get("JINA_API_KEYS") or "").strip() else "direct"
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def _text_search_duckduckgo(queries: List[str], topk: int = 10) -> str:
|
| 56 |
+
try:
|
| 57 |
+
from duckduckgo_search import DDGS
|
| 58 |
+
except ImportError:
|
| 59 |
+
return "[Search] duckduckgo-search is not installed."
|
| 60 |
+
|
| 61 |
+
topk = min(10, max(1, topk))
|
| 62 |
+
results: List[str] = []
|
| 63 |
+
for query in queries:
|
| 64 |
+
q_clean = (query or "").replace('"', "").replace("'", "").strip()
|
| 65 |
+
if not q_clean:
|
| 66 |
+
results.append("No results for empty query.")
|
| 67 |
+
continue
|
| 68 |
+
snippets: List[str] = []
|
| 69 |
+
try:
|
| 70 |
+
with DDGS() as ddgs:
|
| 71 |
+
for row in ddgs.text(q_clean, max_results=topk):
|
| 72 |
+
title = _clean_html_b(row.get("title", "") or "")
|
| 73 |
+
href = row.get("href", "") or ""
|
| 74 |
+
body = _clean_html_b(row.get("body", "") or "")
|
| 75 |
+
snippets.append(f"[{title}]({href}) {body}")
|
| 76 |
+
except Exception as e:
|
| 77 |
+
print(f"[Search] DuckDuckGo query={q_clean!r} error: {e}", flush=True)
|
| 78 |
+
results.append(f"Search failed for '{q_clean}': {e}")
|
| 79 |
+
continue
|
| 80 |
+
results.append("\n\n".join(snippets) if snippets else f"No results for '{q_clean}'.")
|
| 81 |
+
|
| 82 |
+
return "\n\n".join(
|
| 83 |
+
f"--- search result for [{q}] ---\n{r}\n--- end of search result ---"
|
| 84 |
+
for q, r in zip(queries, results)
|
| 85 |
+
)
|
| 86 |
+
|
| 87 |
+
|
| 88 |
def _text_search_sync(queries: List[str], topk: int = 10, max_retry: int = 100) -> str:
|
| 89 |
+
"""Blocking web search: Serper (Google API) or DuckDuckGo (no key)."""
|
| 90 |
if requests is None:
|
| 91 |
return "[Search] requests is not installed."
|
| 92 |
|
| 93 |
+
if _resolve_text_search_backend() == "duckduckgo":
|
| 94 |
+
return _text_search_duckduckgo(queries, topk=topk)
|
| 95 |
+
|
| 96 |
api_key = (os.environ.get("SERPER_KEY_ID") or "").strip()
|
| 97 |
if not api_key:
|
| 98 |
+
return "[Search] SERPER_KEY_ID is not set (or set WEB_TEXT_SEARCH_PROVIDER=duckduckgo for keyless search)."
|
| 99 |
|
| 100 |
url = (os.environ.get("TEXT_SEARCH_API_BASE_URL") or "").strip()
|
| 101 |
if not url:
|
|
|
|
| 248 |
def _browse_sync(
|
| 249 |
url: str,
|
| 250 |
query: str,
|
| 251 |
+
read_engine: Optional[str] = None,
|
| 252 |
generate_engine: str = "deepseekchat",
|
| 253 |
max_retry: int = 10,
|
| 254 |
) -> str:
|
| 255 |
+
"""Fetch page via Jina reader or direct HTTP, then summarize with an LLM."""
|
| 256 |
# Optional random delay to spread traffic.
|
| 257 |
if os.environ.get("BROWSE_RANDOM_SLEEP", "").strip().lower() in ("1", "true", "yes"):
|
| 258 |
time.sleep(random.uniform(0, 16))
|
| 259 |
|
| 260 |
+
engine = ((read_engine or "").strip().lower() or _resolve_browse_read_engine())
|
| 261 |
+
|
| 262 |
+
if engine == "direct":
|
| 263 |
+
try:
|
| 264 |
+
from vision_deepresearch_async_workflow.tools.gen_jina_browse_impl import direct_readpage
|
| 265 |
+
except ImportError:
|
| 266 |
+
direct_readpage = None # type: ignore
|
| 267 |
+
if direct_readpage is None:
|
| 268 |
+
return "[Browse] direct read backend is not available."
|
| 269 |
+
try:
|
| 270 |
+
source_text = direct_readpage(url, max_retry=max_retry)
|
| 271 |
+
except Exception as e:
|
| 272 |
+
print(f"[Browse] direct_readpage error url={url!r}: {e}", flush=True)
|
| 273 |
+
return "Browse error. Please try again."
|
| 274 |
+
elif engine == "jina":
|
| 275 |
+
try:
|
| 276 |
+
from vision_deepresearch_async_workflow.tools.gen_jina_browse_impl import jina_readpage
|
| 277 |
+
except ImportError:
|
| 278 |
+
jina_readpage = None # type: ignore
|
| 279 |
+
|
| 280 |
+
if jina_readpage is None:
|
| 281 |
+
return "[Browse] browse backend is not available."
|
| 282 |
+
|
| 283 |
+
try:
|
| 284 |
+
source_text = jina_readpage(url, max_retry=max_retry)
|
| 285 |
+
except Exception as e:
|
| 286 |
+
print(f"[Browse] jina_readpage error url={url!r}: {e}", flush=True)
|
| 287 |
+
return "Browse error. Please try again."
|
| 288 |
+
else:
|
| 289 |
+
return f"[Browse] Unsupported read_engine={engine!r} (use jina or direct)."
|
| 290 |
+
|
| 291 |
+
_browse_err = (
|
| 292 |
+
"[browse] Failed",
|
| 293 |
+
"[browse] JINA_API_KEYS",
|
| 294 |
+
"[browse] requests library",
|
| 295 |
+
"[browse] Empty URL",
|
| 296 |
+
"[browse] Non-HTML",
|
| 297 |
+
"[browse] Empty page",
|
| 298 |
+
)
|
| 299 |
+
if not source_text.strip() or any(source_text.startswith(p) for p in _browse_err):
|
| 300 |
print(f"[Browse] Empty or failed read for url={url!r}", flush=True)
|
| 301 |
return "Browse error. Please try again."
|
| 302 |
|
|
|
|
| 369 |
|
| 370 |
|
| 371 |
class WebTextSearchTool(DeepResearchTool):
|
| 372 |
+
"""Text search tool (Serper or DuckDuckGo via WEB_TEXT_SEARCH_PROVIDER / SERPER_KEY_ID)."""
|
| 373 |
|
| 374 |
def __init__(self):
|
| 375 |
super().__init__(
|
|
|
|
| 453 |
"required": ["url", "goal"],
|
| 454 |
},
|
| 455 |
)
|
|
|
|
| 456 |
self._generate_engine = os.environ.get("BROWSE_GENERATE_ENGINE", "deepseekchat")
|
| 457 |
self._max_retry = 10
|
| 458 |
|
|
|
|
| 469 |
# Gen-image agent passes "query"; other callers may use "goal"
|
| 470 |
effective_goal = (goal or query or kwargs.get("query") or "").strip()
|
| 471 |
goal = effective_goal or "Detailed summary of the page."
|
| 472 |
+
read_engine = _resolve_browse_read_engine()
|
| 473 |
results: List[str] = []
|
| 474 |
for u in urls[:5]:
|
| 475 |
r = await self._run_blocking(
|
| 476 |
lambda uu=u: _browse_sync(
|
| 477 |
url=uu,
|
| 478 |
query=goal,
|
| 479 |
+
read_engine=read_engine,
|
| 480 |
generate_engine=self._generate_engine,
|
| 481 |
max_retry=self._max_retry,
|
| 482 |
)
|