web-fetch-shared
Generated from plugins/web-fetch-shared/README.md.
Shared HTTP fetch and HTML extraction helpers for vendor web fetch tools.
This is a library package, not a plugin package. It provides reusable components for implementing vendor-specific web fetch tools (Gemini, Qwen, Kimi, etc.).
Installation
pip install -e .
Dependencies
aiohttp>=3.8.0- Async HTTP clienttrafilatura>=1.6.0- HTML-to-text extraction
Usage
Fetch URL
from web_fetch_shared import fetch_url, FetchResult
result = await fetch_url("https://example.com")
if not result.is_error:
print(result.content)
Extract Text from HTML
from web_fetch_shared import extract_text
html = "<html><body><p>Hello world</p></body></html>"
text = extract_text(html, "text/html")
print(text) # "Hello world"
URL Helpers
from web_fetch_shared import normalize_github_url, is_private_url, parse_urls_from_text
# Convert GitHub blob URL to raw content URL
raw_url = normalize_github_url("https://github.com/user/repo/blob/main/README.md")
# Returns: https://raw.githubusercontent.com/user/repo/main/README.md
# Check if URL points to private IP
is_private = is_private_url("http://192.168.1.1/internal")
# Returns: True
# Parse URLs from text
valid, errors = parse_urls_from_text("Check https://example.com and http://[invalid")
Content Utilities
from web_fetch_shared import truncate_content, format_gemini_citations
# Truncate long content
truncated = truncate_content(long_text, max_chars=1000)
# Format Gemini grounding citations
formatted = format_gemini_citations(text, grounding_chunks, grounding_supports)
API Reference
FetchResult
Dataclass representing a URL fetch result.
| Field | Type | Description |
|---|---|---|
content |
str |
Fetched content |
content_type |
str |
Content-Type header |
status_code |
int |
HTTP status code |
headers |
Dict[str, str] |
Response headers |
url |
str |
Final URL after redirects |
is_error |
bool |
Whether this is an error |
error_message |
Optional[str] |
Error message if is_error |
fetch_url()
async def fetch_url(
url: str,
*,
timeout_seconds: float = 10.0,
headers: Optional[Dict[str, str]] = None,
max_size_bytes: int = 10 * 1024 * 1024,
max_redirects: int = 5,
user_agent: str = "Mozilla/5.0 (compatible; AI-Agent-Platform/1.0)",
) -> FetchResult
extract_text()
def extract_text(
html_or_text: str,
content_type: str,
*,
include_comments: bool = True,
include_tables: bool = True,
include_formatting: bool = False,
output_format: str = "txt",
) -> str
normalize_github_url()
def normalize_github_url(url: str) -> str
is_private_url()
def is_private_url(url: str) -> bool
truncate_content()
def truncate_content(
content: str,
max_chars: int = 100000,
truncation_marker: str = "... [Content truncated due to size limit]",
) -> str
format_gemini_citations()
def format_gemini_citations(
text: str,
grounding_chunks: list[dict[str, Any]],
grounding_supports: list[dict[str, Any]],
) -> str
parse_urls_from_text()
def parse_urls_from_text(text: str) -> tuple[list[str], list[str]]
Constants
| Constant | Value | Description |
|---|---|---|
DEFAULT_TIMEOUT_SECONDS |
10.0 |
Default fetch timeout |
DEFAULT_MAX_SIZE_BYTES |
10 * 1024 * 1024 |
Default max response size (10MB) |
DEFAULT_MAX_CHARS |
100000 |
Default truncation limit |
TRUNCATION_MARKER |
"... [Content truncated due to size limit]" |
Default truncation marker |
License
Copyright 2026 Dynamic Programming Solutions Kft.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.