22fa039f1b
Remove debug print
2018-02-26 16:23:14 +01:00
67ad232533
Add a timeout to a single page retrieval
2018-02-26 15:42:36 +01:00
98fe69ba62
Real async crawling
2018-02-26 15:30:38 +01:00
968ff6d24c
More robust crawling
2018-02-26 15:29:36 +01:00
bdfa285e6b
We do not want to use settings
2018-02-26 15:14:53 +01:00
33bdae96e4
merge commit from histories_tobast into histories_models
2018-02-26 12:59:38 +01:00
02e91bb2b7
Fix function calls
2018-02-26 11:56:02 +01:00
45ddbff91a
Crawling and histories: fix a lot of stuff
2018-02-26 11:49:24 +01:00
a4de51b84a
Crawl: do not use global SEARCH_ENGINES
2018-02-26 11:48:51 +01:00
4f0148cb63
Crawler: use a random fingerprint
2018-02-26 11:48:51 +01:00
adb892ab7d
Check if crawling a search engine
2018-02-26 11:12:36 +01:00
15db8b4697
Change option name due to downgrade of aiohttp
2018-02-26 10:23:32 +01:00
15323c3465
[REBASE ME] Crawl: enhance efficiency and output a tree
2018-02-25 15:08:06 +01:00
bc7348f677
Integration of crawl module in histories
2018-02-24 23:17:24 +01:00
d19c2e8216
Add mailto adresses to forbidden list
2018-02-24 15:41:46 +01:00
e56c088632
Better filter
2018-02-24 11:39:04 +01:00
f0b8672c89
Silly me. (bis)
2018-02-23 10:44:51 +01:00
f6da179820
If robots.txt file is invalid, abort mission.
2018-02-23 10:36:14 +01:00
0e02f22d08
Exception handling
...
Big problem with the url https:/plus.google.com/+Python concerning
robots parsing.
Didn't find the bug. @tobast , if you have some time to look at it :)
2018-02-23 00:37:36 +01:00
77ca7ebcb9
Silly me.
2018-02-22 15:35:46 +01:00
9b78e268c9
Nearly working crawler
2018-02-22 14:33:07 +01:00
e19e623df1
Multiple bug fixes. TODO : remove <div id=footer>-like patterns
2018-02-22 14:07:53 +01:00
236e15296c
It can be useful to return the links list
2018-02-21 23:11:57 +01:00
4e6ac5ac7b
Url getter function : retrieves the list of so-called relevant links
2018-02-21 22:51:05 +01:00
a907cad33d
Start of url getter function
2018-02-21 19:06:46 +01:00
b05e642c79
Make the code somewhat readable
2018-02-21 11:54:41 +01:00
c97acb22b5
Add tentative crawl file
...
Nothing functional, just tests
2018-02-20 12:48:53 +01:00
bef1fca5b9
Init app 'crawl'
2018-02-20 08:51:16 +01:00