For some reason, I can use the typesense/docsearch container to scrape my locally hosted Docusaurus site but it does not seem to ‘see’ a different locally hosted docusaurus site. Has anyone else encountered a similar issue?
The env file has the following content:
TYPESENSE_API_KEY=xxxxxxxxxxx
TYPESENSE_HOST=doc-api.wyden.io
TYPESENSE_PORT=443
TYPESENSE_PROTOCOL=https
The config JSON file has the following content:
{
“index_name”: “Documentation”,“start_urls”: [
{
“url”: “http://doc-dev.wyden.io/docs/book/”
}
],
“js_render”: true,
“selectors”: {
“lvl0”: “h1”,
“lvl1”: “h2”,
“lvl2”: “h3”,
“lvl3”: “h4”,
“lvl4”: “h5”,
“lvl5”: “h6”,
“text”: “p, li”
},
“scrape_start_urls”: true,
“strip_chars”: " .,;:#"
}
The command used to run typesense/docsearch is as follows:
run -it --env-file=./Typesense-Scraper-DocDev.env -e “CONFIG=$(cat ./TEST-CONFIG-DOCDEV.json | jq -r tostring)” typesense/docsearch-scraper
(Note that Typesense-Scraper-DocDev.env is the env file and TEST-CONFIG-DOCDEV.json is the config.json file.)
When it runs, the relevant message is as follows:
INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
“http://doc-dev.wyden.io/docs/book/”}
DEBUG:scrapy.core.engine:Crawled (200) <GET http://doc-dev.wyden.io/docs/book/> (referer: None)
DEBUG:typesense.api_call:Making post /collections/CrucialDocumentation_1689867254/documents/import
> > DocSearch: http://doc-dev.wyden.io/docs/book/ 1 records)
Does anyone have a clue as to why the docker container cannot crawl this internal site? It can crawl a locally hosted site - with, of course, different env and config settings. It can also crawl public sites.
Thanks!