Commit graph

19 commits

Author SHA1 Message Date
Théophile Bastian 45ddbff91a Crawling and histories: fix a lot of stuff 2018-02-26 11:49:24 +01:00
Théophile Bastian a4de51b84a Crawl: do not use global SEARCH_ENGINES 2018-02-26 11:48:51 +01:00
Théophile Bastian 4f0148cb63 Crawler: use a random fingerprint 2018-02-26 11:48:51 +01:00
Rémi Oudin adb892ab7d Check if crawling a search engine 2018-02-26 11:12:36 +01:00
Rémi Oudin 15db8b4697 Change option name due to downgrade of aiohttp 2018-02-26 10:23:32 +01:00
Rémi Oudin bc7348f677 Integration of crawl module in histories 2018-02-24 23:17:24 +01:00
Rémi Oudin d19c2e8216 Add mailto adresses to forbidden list 2018-02-24 15:41:46 +01:00
Rémi Oudin e56c088632 Better filter 2018-02-24 11:39:04 +01:00
Rémi Oudin f0b8672c89 Silly me. (bis) 2018-02-23 10:44:51 +01:00
Rémi Oudin f6da179820 If robots.txt file is invalid, abort mission. 2018-02-23 10:36:14 +01:00
Rémi Oudin 0e02f22d08 Exception handling
Big problem with the url https:/plus.google.com/+Python concerning
robots parsing.
Didn't find the bug. @tobast, if you have some time to look at it :)
2018-02-23 00:37:36 +01:00
Rémi Oudin 77ca7ebcb9 Silly me. 2018-02-22 15:35:46 +01:00
Rémi Oudin 9b78e268c9 Nearly working crawler 2018-02-22 14:33:07 +01:00
Rémi Oudin e19e623df1 Multiple bug fixes. TODO : remove <div id=footer>-like patterns 2018-02-22 14:07:53 +01:00
Rémi Oudin 236e15296c It can be useful to return the links list 2018-02-21 23:11:57 +01:00
Rémi Oudin 4e6ac5ac7b Url getter function : retrieves the list of so-called relevant links 2018-02-21 22:51:05 +01:00
Rémi Oudin a907cad33d Start of url getter function 2018-02-21 19:06:46 +01:00
Théophile Bastian b05e642c79 Make the code somewhat readable 2018-02-21 11:54:41 +01:00
Théophile Bastian c97acb22b5 Add tentative crawl file
Nothing functional, just tests
2018-02-20 12:48:53 +01:00