e074d96f02
tor_runner can make requests
2018-02-25 21:27:15 +01:00
93b235cb6c
Fix interests import
2018-02-25 21:20:52 +01:00
ae5699c089
Basic tor runner
2018-02-25 19:42:58 +01:00
f7313ff659
Add populate.sh script
2018-02-25 16:16:04 +01:00
0661fe0f01
Fix path
2018-02-25 16:10:38 +01:00
4b19febdf6
Add interests
2018-02-25 16:10:22 +01:00
15323c3465
[REBASE ME] Crawl: enhance efficiency and output a tree
2018-02-25 15:08:06 +01:00
c3bcdea1eb
Add tentative export to RDF
2018-02-25 14:37:30 +01:00
05a2e2ca3f
Partial generation of profiles
2018-02-25 13:18:12 +01:00
d4aefb6bb7
Load the data
2018-02-25 13:17:44 +01:00
3eb82a4a0b
data for names and emails
2018-02-25 13:17:27 +01:00
7c0fb7dda1
Better naming
2018-02-25 11:49:44 +01:00
ee32e5385b
Finished data import
2018-02-25 11:49:11 +01:00
bc7348f677
Integration of crawl module in histories
2018-02-24 23:17:24 +01:00
60bfc8cb77
Merge branch 'crawl' into histories_models
2018-02-24 18:44:27 +01:00
12c8c652d7
Serialisation function
2018-02-24 18:40:27 +01:00
c58f42476f
Missing script for 854481d
2018-02-24 17:22:52 +01:00
854481dbd3
Import utilities
2018-02-24 17:21:41 +01:00
d19c2e8216
Add mailto adresses to forbidden list
2018-02-24 15:41:46 +01:00
e56c088632
Better filter
2018-02-24 11:39:04 +01:00
2732e4115f
Add RDF models export classes — untested
...
Also add a dependency to https://github.com/tobast/RDFSerializer/
2018-02-23 13:32:32 +01:00
f0b8672c89
Silly me. (bis)
2018-02-23 10:44:51 +01:00
f6da179820
If robots.txt file is invalid, abort mission.
2018-02-23 10:36:14 +01:00
0e02f22d08
Exception handling
...
Big problem with the url https:/plus.google.com/+Python concerning
robots parsing.
Didn't find the bug. @tobast , if you have some time to look at it :)
2018-02-23 00:37:36 +01:00
77ca7ebcb9
Silly me.
2018-02-22 15:35:46 +01:00
9b78e268c9
Nearly working crawler
2018-02-22 14:33:07 +01:00
e19e623df1
Multiple bug fixes. TODO : remove <div id=footer>-like patterns
2018-02-22 14:07:53 +01:00
5decd205fb
Typos + improvements
2018-02-22 11:06:45 +01:00
236e15296c
It can be useful to return the links list
2018-02-21 23:11:57 +01:00
4e6ac5ac7b
Url getter function : retrieves the list of so-called relevant links
2018-02-21 22:51:05 +01:00
a907cad33d
Start of url getter function
2018-02-21 19:06:46 +01:00
ad0ad0a783
Command to add browser fingerprint data
2018-02-21 16:50:27 +01:00
b05e642c79
Make the code somewhat readable
2018-02-21 11:54:41 +01:00
cd4d8a4c3f
More generic code using @8f4458b
2018-02-21 11:50:28 +01:00
8f4458b009
Url generation method, for more genericity
2018-02-21 11:37:44 +01:00
5539f57139
Add missing docstrings
2018-02-21 11:35:53 +01:00
4920de5838
Going on in the generation of history
2018-02-20 23:42:21 +01:00
c97acb22b5
Add tentative crawl file
...
Nothing functional, just tests
2018-02-20 12:48:53 +01:00
c05c2561d2
Add crawler settings and requirements
2018-02-20 12:48:16 +01:00
bef1fca5b9
Init app 'crawl'
2018-02-20 08:51:16 +01:00
7c13ee17d4
Skeleton of history generation
2018-02-19 22:56:16 +01:00
7f343d8ad8
Better formatting
2018-02-19 13:59:29 +01:00
3b0fa27951
Add histories application to settings file
2018-02-19 13:59:29 +01:00
60f09bd4d3
Add basic models for histories
2018-02-19 13:58:55 +01:00
924657abdb
Generate profiles' migration
2018-01-24 22:49:34 +01:00
e9b3127226
Use profiles
as an installed application in pinocchio
2018-01-24 22:49:08 +01:00
cbf1911fe7
Add models for Interest and Profile
2018-01-24 22:48:53 +01:00
37581fb96a
Add models for Place and Event
2018-01-24 22:39:20 +01:00
6531415d63
Add model for a webpage and website
2018-01-24 14:09:33 +01:00
114c8a3d3e
Add model for search engines
2018-01-24 13:52:43 +01:00