timesearch/README.md

115 lines
7.4 KiB
Markdown

timesearch
==========
I don't have a test suite. You're my test suite! Messages go to [/u/GoldenSights](https://reddit.com/u/GoldenSights).
Timesearch is a collection of utilities for archiving subreddits.
### Make sure you have:
- Installed [Python](https://www.python.org/download). I use Python 3.6.
- Installed PRAW >= 4, as well as the other modules in `requirements.txt`. Try `pip install -r requirements.txt` to get them all.
- Created an OAuth app at https://reddit.com/prefs/apps. Make it `script` type, and set the redirect URI to `http://localhost:8080`. The title and description can be anything you want, and the about URL is not required.
- Used [this PRAW script](https://praw.readthedocs.io/en/latest/tutorials/refresh_token.html) to generate a refresh token. Just save it as a .py file somewhere and run it through your terminal / command line. For simplicity's sake, I just choose `all` for the scopes.
- Downloaded a copy of [this file](https://github.com/voussoir/reddit/blob/master/bot4.py) and saved it as `bot.py`. Fill out the variables using your OAuth information, and read the instructions to see where to put it. The Useragent is a description of your API usage. Typically "/u/username's praw client" is sufficient.
### This package consists of:
- **timesearch**: If you try to page through `/new` on a subreddit, you'll hit a limit at or before 1,000 posts. Timesearch uses the `timestamp` cloudsearch query parameter to step from the beginning of a subreddit to present time, to collect as many submissions as possible. Read more about timestamp searching [here](https://www.reddit.com/r/reddittips/comments/2ix73n/use_cloudsearch_to_search_for_posts_on_reddit/).
`> timesearch.py timesearch -r subredditname <flags>`
`> timesearch.py timesearch -u username <flags>`
- **commentaugment**: Although we can search for submissions, we cannot search for comments. After performing a timesearch, you can use commentaugment to download the comment tree for each submission.
Note: commentaugment only gets the comments attached to the submissions that you found in your timesearch scan. If you're trying to commentaugment on a user, you're going to get comments that were made on their submissions, **not** comments they made on other people's submissions. Therefore, comprehensively collecting a user's activity is not possible. You will have to use someone else's dataset like that of [/u/Stuck_in_the_Matrix](https://reddit.com/u/Stuck_in_the_Matrix) at [pushshift.io](https://pushshift.io).
`> timesearch.py commentaugment -r subredditname <flags>`
`> timesearch.py commentaugment -u username <flags>`
- **livestream**: timesearch+commentaugment is great for starting your database and getting historical posts, but it's not the best for staying up-to-date. Instead, livestream monitors `/new` and `/comments` to continuously ingest data.
`> timesearch.py livestream -r subredditname <flags>`
`> timesearch.py livestream -u username <flags>`
- **getstyles**: Downloads the stylesheet and CSS images.
`> timesearch.py getstyles -r subredditname`
- **getwiki**: Downloads the wiki pages, sidebar, etc. from /wiki/pages.
`> timesearch.py getwiki -r subredditname`
- **offline_reading**: Renders comment threads into HTML via markdown.
Note: I'm currently using the [markdown library from pypi](https://pypi.python.org/pypi/Markdown), and it doesn't do reddit's custom markdown like `/r/` or `/u/`, obviously. So far I don't think anybody really uses o_r so I haven't invested much time into improving it.
`> timesearch.py offline_reading -r subredditname <flags>`
`> timesearch.py offline_reading -u username <flags>`
- **redmash**: Generates plaintext or HTML lists of submissions, sorted by a property of your choosing. You can order by date, author, flair, etc.
`> timesearch.py redmash -r subredditname <flags>`
`> timesearch.py redmash -u username <flags>`
- **breakdown**: Produces a JSON file indicating which users make the most posts in a subreddit, or which subreddits a user posts in.
`> timesearch.py breakdown -r subredditname` <flags>
`> timesearch.py breakdown -u username` <flags>
- **mergedb**: Copy all new data from one timesearch database into another. Useful for syncing or merging two scans of the same subreddit.
`> timesearch.py mergedb --from filepath/database1.db --to filepath/database2.db`
### To use it
You will need both the `timesearch` package (folder) and the external `timesearch.py` file. You can click the green "Clone or Download" button in the upper right. When you run the .py file, it sends your commandline arguments into the package. You can view a summarized version of all the help text with just `timesearch.py`, or you can view a specific docstring with `timesearch.py livestream`, etc.
I recommend [sqlitebrowser](https://github.com/sqlitebrowser/sqlitebrowser/releases) if you want to inspect the database yourself.
### Changelog
- 2017 11 13
- Gave timesearch its own Github repository so that (1) it will be easier for people to download it and (2) it has a cleaner, more independent URL. [voussoir/timesearch](https://github.com/voussoir/timesearch)
- 2017 11 05
- Added a try-except inside livestream helper to prevent generator from terminating.
- 2017 11 04
- For timesearch, I switched from using my custom cloudsearch iterator to the one that comes with PRAW4+.
- 2017 10 12
- Added the `mergedb` utility for combining databases.
- 2017 06 02
- You can use `commentaugment -s abcdef` to get a particular thread even if you haven't scraped anything else from that subreddit. Previously `-s` only worked if the database already existed and you specified it via `-r`. Now it is inferred from the submission itself.
- 2017 04 28
- Complete restructure into package, started using PRAW4.
- 2016 08 10
- Started merging redmash and wrote its argparser
- 2016 07 03
- Improved docstring clarity.
- 2016 07 02
- Added `livestream` argparse
- 2016 06 07
- Offline_reading has been merged with the main timesearch file
- `get_all_posts` renamed to `timesearch`
- Timesearch parameter `usermode` renamed to `username`; `maxupper` renamed to `upper`.
- Everything now accessible via commandline arguments. Read the docstring at the top of the file.
- 2016 06 05
- NEW DATABASE SCHEME. Submissions and comments now live in different tables like they should have all along. Submission table has two new columns for a little bit of commentaugment metadata. This allows commentaugment to only scan threads that are new.
- You can use the `migrate_20160605.py` script to convert old databases into new ones.
- 2015 11 11
- created `offline_reading.py` which converts a timesearch database into a comment tree that can be rendered into HTML
- 2015 09 07
- fixed bug which allowed `livestream` to crash because `bot.refresh()` was outside of the try-catch.
- 2015 08 19
- fixed bug in which updatescores stopped iterating early if you had more than 100 comments in a row in the db
- commentaugment has been completely merged into the timesearch.py file. you can use commentaugment_prompt() to input the parameters, or use the commentaugment() function directly.
____
I want to live in a future where everyone uses UTC and agrees on daylight savings.
<p align="center">
<img src="https://github.com/voussoir/reddit/blob/master/.GitImages/timesearch_logo_256.png?raw=true" alt="Timesearch"/>
</p>