mpri-webdam

Author	SHA1	Message	Date
Théophile Bastian	b7be4f4df4	Crawling and histories: fix a lot of stuff	2018-02-26 11:47:31 +01:00
Théophile Bastian	fd4e1d35c7	Crawl: do not use global SEARCH_ENGINES	2018-02-26 11:45:08 +01:00
Théophile Bastian	8f1d69bc41	Crawler: use a random fingerprint	2018-02-26 11:39:55 +01:00
Rémi Oudin	adb892ab7d	Check if crawling a search engine	2018-02-26 11:12:36 +01:00
Rémi Oudin	15db8b4697	Change option name due to downgrade of aiohttp	2018-02-26 10:23:32 +01:00
Rémi Oudin	bc7348f677	Integration of crawl module in histories	2018-02-24 23:17:24 +01:00
Rémi Oudin	d19c2e8216	Add mailto adresses to forbidden list	2018-02-24 15:41:46 +01:00
Rémi Oudin	e56c088632	Better filter	2018-02-24 11:39:04 +01:00
Rémi Oudin	f0b8672c89	Silly me. (bis)	2018-02-23 10:44:51 +01:00
Rémi Oudin	f6da179820	If robots.txt file is invalid, abort mission.	2018-02-23 10:36:14 +01:00
Rémi Oudin	0e02f22d08	Exception handling Big problem with the url https:/plus.google.com/+Python concerning robots parsing. Didn't find the bug. @tobast, if you have some time to look at it :)	2018-02-23 00:37:36 +01:00
Rémi Oudin	77ca7ebcb9	Silly me.	2018-02-22 15:35:46 +01:00
Rémi Oudin	9b78e268c9	Nearly working crawler	2018-02-22 14:33:07 +01:00
Rémi Oudin	e19e623df1	Multiple bug fixes. TODO : remove <div id=footer>-like patterns	2018-02-22 14:07:53 +01:00
Rémi Oudin	236e15296c	It can be useful to return the links list	2018-02-21 23:11:57 +01:00
Rémi Oudin	4e6ac5ac7b	Url getter function : retrieves the list of so-called relevant links	2018-02-21 22:51:05 +01:00
Rémi Oudin	a907cad33d	Start of url getter function	2018-02-21 19:06:46 +01:00
Théophile Bastian	b05e642c79	Make the code somewhat readable	2018-02-21 11:54:41 +01:00
Théophile Bastian	c97acb22b5	Add tentative crawl file Nothing functional, just tests	2018-02-20 12:48:53 +01:00
Théophile Bastian	bef1fca5b9	Init app 'crawl'	2018-02-20 08:51:16 +01:00

20 commits