scitex_web

scitex-web — web scraping + PubMed search + URL summarization (standalone).

scitex_web.search_pubmed(query, n_entries=10, *, search_fn=None, fetch_fn=None)[source]

Return type:: int

scitex_web.get_crossref_metrics(doi, api_key=None, email=None, *, base_url='https://api.crossref.org/works/')[source]

Get article metrics from CrossRef using DOI.

Return type:: Dict[str, Any]

scitex_web.summarize_url(start_url, *, crawl_fn=None, summarize_fn=None)[source]

Crawl start_url then summarize it.

crawl_fn defaults to crawl_to_json(); summarize_fn defaults to summarize_all(). Both are injectable so the composition can be exercised with deterministic stand-ins.

scitex_web.crawl_url(url, max_depth=1)[source]

scitex_web.crawl_to_json(start_url, *, crawler=None, genai_factory=None)[source]

Crawl start_url and summarize each page into a JSON document.

crawler defaults to crawl_url(); genai_factory defaults to _get_genai(). Both are injectable so callers can supply a real local crawler / a deterministic summarizer without monkey-patching.

scitex_web.get_urls(url, pattern=None, absolute=True, same_domain=False, include_external=True, *, http_get=None)[source]

Extract all URLs from a webpage.

Parameters:

url (str) – The URL of the webpage to scrape
pattern (Optional[str]) – Optional regex pattern to filter URLs (e.g., r’.pdf$’ for PDF files)
absolute (bool) – If True, convert relative URLs to absolute URLs
same_domain (bool) – If True, only return URLs from the same domain
include_external (bool) – If True, include external links (only applies if same_domain=False)
http_get – Injected HTTP GET callable matching requests.get(url, *, timeout, headers). Defaults to requests.get(). Tests pass a hand-rolled fake; production code never sets this.

Return type:

List[str]

Returns:

List of URLs found on the page

Example

>>> urls = get_urls('https://example.com', pattern=r'\.pdf$')
>>> urls = get_urls('https://example.com', same_domain=True)

scitex_web.download_images(url, output_dir=None, min_size=None, max_workers=5, same_domain=False)[source]

Download images from a URL.

Parameters:

url (str) – Webpage URL or direct image URL
output_dir (Optional[str]) – Output directory (default: $SCITEX_DIR/web/downloads)
min_size (Optional[Tuple[int, int]]) – Minimum (width, height) to filter small images (default: 400x300)
max_workers (int) – Concurrent download threads
same_domain (bool) – Only download images from the same domain

Return type:

List[str]

Returns:

List of downloaded file paths

Example

>>> paths = download_images("https://example.com")
>>> paths = download_images("https://example.com/photo.jpg")
>>> paths = download_images("https://example.com", min_size=(800, 600))

scitex_web.get_image_urls(url, pattern=None, same_domain=False, *, http_get=None)[source]

Extract all image URLs from a webpage without downloading them.

Parameters:

url (str) – The URL of the webpage to scrape
pattern (Optional[str]) – Optional regex pattern to filter image URLs
same_domain (bool) – If True, only return images from the same domain
http_get – Injected HTTP GET callable matching requests.get(url, *, timeout, headers). Defaults to requests.get(). Tests pass a hand-rolled fake; production code never sets this.

Return type:

List[str]

Returns:

List of image URLs found on the page

Note

SVG files are automatically skipped (vector graphics)
Checks both ‘src’ and ‘data-src’ attributes for lazy-loaded images

Example

>>> img_urls = get_image_urls('https://example.com')
>>> img_urls = get_image_urls('https://example.com', pattern=r'\.png$')

Modules

download_images(url[, output_dir, min_size, ...])

Download images from a URL.