Commit Graph

43 Commits

Author SHA1 Message Date
Rémi Oudin 93b235cb6c Fix interests import 2018-02-25 21:20:52 +01:00
Théophile Bastian 15323c3465 [REBASE ME] Crawl: enhance efficiency and output a tree 2018-02-25 15:08:06 +01:00
Rémi Oudin bc7348f677 Integration of crawl module in histories 2018-02-24 23:17:24 +01:00
Rémi Oudin 60bfc8cb77 Merge branch 'crawl' into histories_models 2018-02-24 18:44:27 +01:00
Rémi Oudin 12c8c652d7 Serialisation function 2018-02-24 18:40:27 +01:00
Rémi Oudin c58f42476f Missing script for 854481d 2018-02-24 17:22:52 +01:00
Rémi Oudin 854481dbd3 Import utilities 2018-02-24 17:21:41 +01:00
Rémi Oudin d19c2e8216 Add mailto adresses to forbidden list 2018-02-24 15:41:46 +01:00
Rémi Oudin e56c088632 Better filter 2018-02-24 11:39:04 +01:00
Rémi Oudin f0b8672c89 Silly me. (bis) 2018-02-23 10:44:51 +01:00
Rémi Oudin f6da179820 If robots.txt file is invalid, abort mission. 2018-02-23 10:36:14 +01:00
Rémi Oudin 0e02f22d08 Exception handling
Big problem with the url https:/plus.google.com/+Python concerning
robots parsing.
Didn't find the bug. @tobast, if you have some time to look at it :)
2018-02-23 00:37:36 +01:00
Rémi Oudin 77ca7ebcb9 Silly me. 2018-02-22 15:35:46 +01:00
Rémi Oudin 9b78e268c9 Nearly working crawler 2018-02-22 14:33:07 +01:00
Rémi Oudin e19e623df1 Multiple bug fixes. TODO : remove <div id=footer>-like patterns 2018-02-22 14:07:53 +01:00
Rémi Oudin 5decd205fb Typos + improvements 2018-02-22 11:06:45 +01:00
Rémi Oudin 236e15296c It can be useful to return the links list 2018-02-21 23:11:57 +01:00
Rémi Oudin 4e6ac5ac7b Url getter function : retrieves the list of so-called relevant links 2018-02-21 22:51:05 +01:00
Rémi Oudin a907cad33d Start of url getter function 2018-02-21 19:06:46 +01:00
Rémi Oudin ad0ad0a783 Command to add browser fingerprint data 2018-02-21 16:50:27 +01:00
Théophile Bastian b05e642c79 Make the code somewhat readable 2018-02-21 11:54:41 +01:00
Rémi Oudin cd4d8a4c3f More generic code using @8f4458b 2018-02-21 11:50:28 +01:00
Rémi Oudin 8f4458b009 Url generation method, for more genericity 2018-02-21 11:37:44 +01:00
Rémi Oudin 5539f57139 Add missing docstrings 2018-02-21 11:35:53 +01:00
Rémi Oudin 4920de5838 Going on in the generation of history 2018-02-20 23:42:21 +01:00
Théophile Bastian c97acb22b5 Add tentative crawl file
Nothing functional, just tests
2018-02-20 12:48:53 +01:00
Théophile Bastian c05c2561d2 Add crawler settings and requirements 2018-02-20 12:48:16 +01:00
Théophile Bastian bef1fca5b9 Init app 'crawl' 2018-02-20 08:51:16 +01:00
Rémi Oudin 7c13ee17d4 Skeleton of history generation 2018-02-19 22:56:16 +01:00
Rémi Oudin 7f343d8ad8 Better formatting 2018-02-19 13:59:29 +01:00
Rémi Oudin 3b0fa27951 Add histories application to settings file 2018-02-19 13:59:29 +01:00
Rémi Oudin 60f09bd4d3 Add basic models for histories 2018-02-19 13:58:55 +01:00
Théophile Bastian 924657abdb Generate profiles' migration 2018-01-24 22:49:34 +01:00
Théophile Bastian e9b3127226 Use `profiles` as an installed application in pinocchio 2018-01-24 22:49:08 +01:00
Théophile Bastian cbf1911fe7 Add models for Interest and Profile 2018-01-24 22:48:53 +01:00
Théophile Bastian 37581fb96a Add models for Place and Event 2018-01-24 22:39:20 +01:00
Théophile Bastian 6531415d63 Add model for a webpage and website 2018-01-24 14:09:33 +01:00
Théophile Bastian 114c8a3d3e Add model for search engines 2018-01-24 13:52:43 +01:00
Théophile Bastian 225742798b Add BrowserFingerprint model 2018-01-24 13:36:55 +01:00
Théophile Bastian a3e6308837 Init apps `histories` and `profiles` 2018-01-23 18:12:47 +01:00
Théophile Bastian 397784a673 Add first version of requirements.txt
Mainly Django, by now
2018-01-23 18:11:07 +01:00
Théophile Bastian 132b7250c8 Initialize Django 2018-01-23 18:11:00 +01:00
Théophile Bastian c1e3be346f Initial commit 2018-01-23 17:53:08 +01:00