scitex_web
scitex-web — web scraping + PubMed search + URL summarization (standalone).
- scitex_web.search_pubmed(query, n_entries=10, *, search_fn=None, fetch_fn=None)[source]
- Return type:
- scitex_web.get_crossref_metrics(doi, api_key=None, email=None, *, base_url='https://api.crossref.org/works/')[source]
Get article metrics from CrossRef using DOI.
- scitex_web.summarize_url(start_url, *, crawl_fn=None, summarize_fn=None)[source]
Crawl
start_urlthen summarize it.crawl_fndefaults tocrawl_to_json();summarize_fndefaults tosummarize_all(). Both are injectable so the composition can be exercised with deterministic stand-ins.
- scitex_web.crawl_to_json(start_url, *, crawler=None, genai_factory=None)[source]
Crawl
start_urland summarize each page into a JSON document.crawlerdefaults tocrawl_url();genai_factorydefaults to_get_genai(). Both are injectable so callers can supply a real local crawler / a deterministic summarizer without monkey-patching.
- scitex_web.get_urls(url, pattern=None, absolute=True, same_domain=False, include_external=True, *, http_get=None)[source]
Extract all URLs from a webpage.
- Parameters:
url (
str) – The URL of the webpage to scrapepattern (
Optional[str]) – Optional regex pattern to filter URLs (e.g., r’.pdf$’ for PDF files)absolute (
bool) – If True, convert relative URLs to absolute URLssame_domain (
bool) – If True, only return URLs from the same domaininclude_external (
bool) – If True, include external links (only applies if same_domain=False)http_get – Injected HTTP GET callable matching
requests.get(url, *, timeout, headers). Defaults torequests.get(). Tests pass a hand-rolled fake; production code never sets this.
- Return type:
- Returns:
List of URLs found on the page
Example
>>> urls = get_urls('https://example.com', pattern=r'\.pdf$') >>> urls = get_urls('https://example.com', same_domain=True)
- scitex_web.download_images(url, output_dir=None, min_size=None, max_workers=5, same_domain=False)[source]
Download images from a URL.
- Parameters:
url (
str) – Webpage URL or direct image URLoutput_dir (
Optional[str]) – Output directory (default: $SCITEX_DIR/web/downloads)min_size (
Optional[Tuple[int,int]]) – Minimum (width, height) to filter small images (default: 400x300)max_workers (
int) – Concurrent download threadssame_domain (
bool) – Only download images from the same domain
- Return type:
- Returns:
List of downloaded file paths
Example
>>> paths = download_images("https://example.com") >>> paths = download_images("https://example.com/photo.jpg") >>> paths = download_images("https://example.com", min_size=(800, 600))
- scitex_web.get_image_urls(url, pattern=None, same_domain=False, *, http_get=None)[source]
Extract all image URLs from a webpage without downloading them.
- Parameters:
url (
str) – The URL of the webpage to scrapepattern (
Optional[str]) – Optional regex pattern to filter image URLssame_domain (
bool) – If True, only return images from the same domainhttp_get – Injected HTTP GET callable matching
requests.get(url, *, timeout, headers). Defaults torequests.get(). Tests pass a hand-rolled fake; production code never sets this.
- Return type:
- Returns:
List of image URLs found on the page
Note
SVG files are automatically skipped (vector graphics)
Checks both ‘src’ and ‘data-src’ attributes for lazy-loaded images
Example
>>> img_urls = get_image_urls('https://example.com') >>> img_urls = get_image_urls('https://example.com', pattern=r'\.png$')
Modules
|
Download images from a URL. |