Commit graph

63 commits

Author SHA1 Message Date
a4de51b84a Crawl: do not use global SEARCH_ENGINES 2018-02-26 11:48:51 +01:00
4f0148cb63 Crawler: use a random fingerprint 2018-02-26 11:48:51 +01:00
4a8bd32516 Fix tor_runner import 2018-02-26 11:48:51 +01:00
44cf26df8f It can be useful to save a new object 2018-02-26 11:42:45 +01:00
adb892ab7d Check if crawling a search engine 2018-02-26 11:12:36 +01:00
15db8b4697 Change option name due to downgrade of aiohttp 2018-02-26 10:23:32 +01:00
d6b26c0a46 Better use of history 2018-02-26 10:05:33 +01:00
8f5c4f3f0f Use datetimes 2018-02-26 09:49:24 +01:00
71d9e18eec Add headers support 2018-02-25 23:56:51 +01:00
8ad46c0481 Bug fix, syntax erro 2018-02-25 21:59:29 +01:00
f66c978466 Tor runner has a run function to replay the history 2018-02-25 21:53:28 +01:00
0a676a2f65 PEP8 2018-02-25 21:34:20 +01:00
e074d96f02 tor_runner can make requests 2018-02-25 21:27:15 +01:00
ae5699c089 Basic tor runner 2018-02-25 19:42:58 +01:00
f7313ff659 Add populate.sh script 2018-02-25 16:16:04 +01:00
0661fe0f01 Fix path 2018-02-25 16:10:38 +01:00
4b19febdf6 Add interests 2018-02-25 16:10:22 +01:00
05a2e2ca3f Partial generation of profiles 2018-02-25 13:18:12 +01:00
d4aefb6bb7 Load the data 2018-02-25 13:17:44 +01:00
3eb82a4a0b data for names and emails 2018-02-25 13:17:27 +01:00
7c0fb7dda1 Better naming 2018-02-25 11:49:44 +01:00
ee32e5385b Finished data import 2018-02-25 11:49:11 +01:00
bc7348f677 Integration of crawl module in histories 2018-02-24 23:17:24 +01:00
60bfc8cb77 Merge branch 'crawl' into histories_models 2018-02-24 18:44:27 +01:00
12c8c652d7 Serialisation function 2018-02-24 18:40:27 +01:00
c58f42476f Missing script for 854481d 2018-02-24 17:22:52 +01:00
854481dbd3 Import utilities 2018-02-24 17:21:41 +01:00
d19c2e8216 Add mailto adresses to forbidden list 2018-02-24 15:41:46 +01:00
e56c088632 Better filter 2018-02-24 11:39:04 +01:00
f0b8672c89 Silly me. (bis) 2018-02-23 10:44:51 +01:00
f6da179820 If robots.txt file is invalid, abort mission. 2018-02-23 10:36:14 +01:00
0e02f22d08 Exception handling
Big problem with the url https:/plus.google.com/+Python concerning
robots parsing.
Didn't find the bug. @tobast, if you have some time to look at it :)
2018-02-23 00:37:36 +01:00
77ca7ebcb9 Silly me. 2018-02-22 15:35:46 +01:00
9b78e268c9 Nearly working crawler 2018-02-22 14:33:07 +01:00
e19e623df1 Multiple bug fixes. TODO : remove <div id=footer>-like patterns 2018-02-22 14:07:53 +01:00
5decd205fb Typos + improvements 2018-02-22 11:06:45 +01:00
236e15296c It can be useful to return the links list 2018-02-21 23:11:57 +01:00
4e6ac5ac7b Url getter function : retrieves the list of so-called relevant links 2018-02-21 22:51:05 +01:00
a907cad33d Start of url getter function 2018-02-21 19:06:46 +01:00
ad0ad0a783 Command to add browser fingerprint data 2018-02-21 16:50:27 +01:00
b05e642c79 Make the code somewhat readable 2018-02-21 11:54:41 +01:00
cd4d8a4c3f More generic code using @8f4458b 2018-02-21 11:50:28 +01:00
8f4458b009 Url generation method, for more genericity 2018-02-21 11:37:44 +01:00
5539f57139 Add missing docstrings 2018-02-21 11:35:53 +01:00
4920de5838 Going on in the generation of history 2018-02-20 23:42:21 +01:00
c97acb22b5 Add tentative crawl file
Nothing functional, just tests
2018-02-20 12:48:53 +01:00
c05c2561d2 Add crawler settings and requirements 2018-02-20 12:48:16 +01:00
bef1fca5b9 Init app 'crawl' 2018-02-20 08:51:16 +01:00
7c13ee17d4 Skeleton of history generation 2018-02-19 22:56:16 +01:00
7f343d8ad8 Better formatting 2018-02-19 13:59:29 +01:00