mpri-webdam

Author	SHA1	Message	Date
Théophile Bastian	22064ebee3	Histories: xml import/export — untested To be tested when history generation is available	2018-02-26 11:48:51 +01:00
Théophile Bastian	a4de51b84a	Crawl: do not use global SEARCH_ENGINES	2018-02-26 11:48:51 +01:00
Théophile Bastian	4f0148cb63	Crawler: use a random fingerprint	2018-02-26 11:48:51 +01:00
Théophile Bastian	4a8bd32516	Fix tor_runner import	2018-02-26 11:48:51 +01:00
Rémi Oudin	44cf26df8f	It can be useful to save a new object	2018-02-26 11:42:45 +01:00
Rémi Oudin	adb892ab7d	Check if crawling a search engine	2018-02-26 11:12:36 +01:00
Rémi Oudin	15db8b4697	Change option name due to downgrade of aiohttp	2018-02-26 10:23:32 +01:00
Rémi Oudin	d6b26c0a46	Better use of history	2018-02-26 10:05:33 +01:00
Rémi Oudin	8f5c4f3f0f	Use datetimes	2018-02-26 09:49:24 +01:00
Rémi Oudin	71d9e18eec	Add headers support	2018-02-25 23:56:51 +01:00
Rémi Oudin	8ad46c0481	Bug fix, syntax erro	2018-02-25 21:59:29 +01:00
Rémi Oudin	f66c978466	Tor runner has a run function to replay the history	2018-02-25 21:53:28 +01:00
Rémi Oudin	0a676a2f65	PEP8	2018-02-25 21:34:20 +01:00
Rémi Oudin	e074d96f02	tor_runner can make requests	2018-02-25 21:27:15 +01:00
Rémi Oudin	ae5699c089	Basic tor runner	2018-02-25 19:42:58 +01:00
Rémi Oudin	f7313ff659	Add populate.sh script	2018-02-25 16:16:04 +01:00
Rémi Oudin	0661fe0f01	Fix path	2018-02-25 16:10:38 +01:00
Rémi Oudin	4b19febdf6	Add interests	2018-02-25 16:10:22 +01:00
Rémi Oudin	05a2e2ca3f	Partial generation of profiles	2018-02-25 13:18:12 +01:00
Rémi Oudin	d4aefb6bb7	Load the data	2018-02-25 13:17:44 +01:00
Rémi Oudin	3eb82a4a0b	data for names and emails	2018-02-25 13:17:27 +01:00
Rémi Oudin	7c0fb7dda1	Better naming	2018-02-25 11:49:44 +01:00
Rémi Oudin	ee32e5385b	Finished data import	2018-02-25 11:49:11 +01:00
Rémi Oudin	bc7348f677	Integration of crawl module in histories	2018-02-24 23:17:24 +01:00
Rémi Oudin	60bfc8cb77	Merge branch 'crawl' into histories_models	2018-02-24 18:44:27 +01:00
Rémi Oudin	12c8c652d7	Serialisation function	2018-02-24 18:40:27 +01:00
Rémi Oudin	c58f42476f	Missing script for `854481d`	2018-02-24 17:22:52 +01:00
Rémi Oudin	854481dbd3	Import utilities	2018-02-24 17:21:41 +01:00
Rémi Oudin	d19c2e8216	Add mailto adresses to forbidden list	2018-02-24 15:41:46 +01:00
Rémi Oudin	e56c088632	Better filter	2018-02-24 11:39:04 +01:00
Rémi Oudin	f0b8672c89	Silly me. (bis)	2018-02-23 10:44:51 +01:00
Rémi Oudin	f6da179820	If robots.txt file is invalid, abort mission.	2018-02-23 10:36:14 +01:00
Rémi Oudin	0e02f22d08	Exception handling Big problem with the url https:/plus.google.com/+Python concerning robots parsing. Didn't find the bug. @tobast, if you have some time to look at it :)	2018-02-23 00:37:36 +01:00
Rémi Oudin	77ca7ebcb9	Silly me.	2018-02-22 15:35:46 +01:00
Rémi Oudin	9b78e268c9	Nearly working crawler	2018-02-22 14:33:07 +01:00
Rémi Oudin	e19e623df1	Multiple bug fixes. TODO : remove <div id=footer>-like patterns	2018-02-22 14:07:53 +01:00
Rémi Oudin	5decd205fb	Typos + improvements	2018-02-22 11:06:45 +01:00
Rémi Oudin	236e15296c	It can be useful to return the links list	2018-02-21 23:11:57 +01:00
Rémi Oudin	4e6ac5ac7b	Url getter function : retrieves the list of so-called relevant links	2018-02-21 22:51:05 +01:00
Rémi Oudin	a907cad33d	Start of url getter function	2018-02-21 19:06:46 +01:00
Rémi Oudin	ad0ad0a783	Command to add browser fingerprint data	2018-02-21 16:50:27 +01:00
Théophile Bastian	b05e642c79	Make the code somewhat readable	2018-02-21 11:54:41 +01:00
Rémi Oudin	cd4d8a4c3f	More generic code using @8f4458b	2018-02-21 11:50:28 +01:00
Rémi Oudin	8f4458b009	Url generation method, for more genericity	2018-02-21 11:37:44 +01:00
Rémi Oudin	5539f57139	Add missing docstrings	2018-02-21 11:35:53 +01:00
Rémi Oudin	4920de5838	Going on in the generation of history	2018-02-20 23:42:21 +01:00
Théophile Bastian	c97acb22b5	Add tentative crawl file Nothing functional, just tests	2018-02-20 12:48:53 +01:00
Théophile Bastian	c05c2561d2	Add crawler settings and requirements	2018-02-20 12:48:16 +01:00
Théophile Bastian	bef1fca5b9	Init app 'crawl'	2018-02-20 08:51:16 +01:00
Rémi Oudin	7c13ee17d4	Skeleton of history generation	2018-02-19 22:56:16 +01:00

1 2

64 commits