Update readme.

master
Ethan Dalool 2020-01-27 19:23:50 -08:00
parent ecfb96820c
commit d982d64071
1 changed files with 22 additions and 16 deletions

View File

@ -16,55 +16,61 @@ I don't have a test suite. You're my test suite! Messages go to [/u/GoldenSights
Timesearch is a collection of utilities for archiving subreddits.
### Make sure you have:
- Installed [Python](https://www.python.org/download). I use Python 3.6.
- Installed [Python](https://www.python.org/download). I use Python 3.7.
- Installed PRAW >= 4, as well as the other modules in `requirements.txt`. Try `pip install -r requirements.txt` to get them all.
- Created an OAuth app at https://reddit.com/prefs/apps. Make it `script` type, and set the redirect URI to `http://localhost:8080`. The title and description can be anything you want, and the about URL is not required.
- Used [this PRAW script](https://praw.readthedocs.io/en/latest/tutorials/refresh_token.html) to generate a refresh token. Just save it as a .py file somewhere and run it through your terminal / command line. For simplicity's sake, I just choose `all` for the scopes.
- Downloaded a copy of [this file](https://github.com/voussoir/reddit/blob/master/bot4.py) and saved it as `bot.py`. Fill out the variables using your OAuth information, and read the instructions to see where to put it. The Useragent is a description of your API usage. Typically "/u/username's praw client" is sufficient.
- Downloaded this project using the green "Clone or Download" button in the upper right.
### This package consists of:
- **timesearch**: If you try to page through `/new` on a subreddit, you'll hit a limit at or before 1,000 posts. Timesearch uses the `timestamp` cloudsearch query parameter to step from the beginning of a subreddit to present time, to collect as many submissions as possible. Read more about timestamp searching [here](https://www.reddit.com/r/reddittips/comments/2ix73n/use_cloudsearch_to_search_for_posts_on_reddit/).
- **get_submissions**: If you try to page through `/new` on a subreddit, you'll hit a limit at or before 1,000 posts. Timesearch uses the pushshift.io dataset to get information about very old posts, and then queries the reddit api to update their information. Previously, we used the `timestamp` cloudsearch query parameter on reddit's own API, but reddit has removed that feature and pushshift is now the only viable source for initial data.
`> timesearch.py timesearch -r subredditname <flags>`
`> timesearch.py timesearch -u username <flags>`
- **commentaugment**: Although we can search for submissions, we cannot search for comments. After performing a timesearch, you can use commentaugment to download the comments on a subreddit, or the comments made by a user.
`> timesearch.py commentaugment -r subredditname <flags>`
`> timesearch.py commentaugment -u username <flags>`
- **get_comments**: Similar to `get_submissions`, this tool queries pushshift for comment data and updates it from reddit.
`> timesearch.py get_comments -r subredditname <flags>`
`> timesearch.py get_comments -u username <flags>`
- **livestream**: timesearch+commentaugment is great for starting your database and getting historical posts, but it's not the best for staying up-to-date. Instead, livestream monitors `/new` and `/comments` to continuously ingest data.
- **livestream**: get_submissions+get_comments is great for starting your database and getting the historical posts, but it's not the best for staying up-to-date. Instead, livestream monitors `/new` and `/comments` to continuously ingest data.
`> timesearch.py livestream -r subredditname <flags>`
`> timesearch.py livestream -u username <flags>`
- **getstyles**: Downloads the stylesheet and CSS images.
`> timesearch.py getstyles -r subredditname`
- **get_styles**: Downloads the stylesheet and CSS images.
`> timesearch.py get_styles -r subredditname`
- **getwiki**: Downloads the wiki pages, sidebar, etc. from /wiki/pages.
`> timesearch.py getwiki -r subredditname`
- **get_wiki**: Downloads the wiki pages, sidebar, etc. from /wiki/pages.
`> timesearch.py get_wiki -r subredditname`
- **offline_reading**: Renders comment threads into HTML via markdown.
Note: I'm currently using the [markdown library from pypi](https://pypi.python.org/pypi/Markdown), and it doesn't do reddit's custom markdown like `/r/` or `/u/`, obviously. So far I don't think anybody really uses o_r so I haven't invested much time into improving it.
`> timesearch.py offline_reading -r subredditname <flags>`
`> timesearch.py offline_reading -u username <flags>`
- **redmash**: Generates plaintext or HTML lists of submissions, sorted by a property of your choosing. You can order by date, author, flair, etc.
`> timesearch.py redmash -r subredditname <flags>`
`> timesearch.py redmash -u username <flags>`
- **index**: Generates plaintext or HTML lists of submissions, sorted by a property of your choosing. You can order by date, author, flair, etc. With the `--offline` parameter, you can make all the links point to the files you generated with `offline_reading`.
`> timesearch.py index -r subredditname <flags>`
`> timesearch.py index -u username <flags>`
- **breakdown**: Produces a JSON file indicating which users make the most posts in a subreddit, or which subreddits a user posts in.
`> timesearch.py breakdown -r subredditname` <flags>
`> timesearch.py breakdown -u username` <flags>
- **mergedb**: Copy all new data from one timesearch database into another. Useful for syncing or merging two scans of the same subreddit.
`> timesearch.py mergedb --from filepath/database1.db --to filepath/database2.db`
- **merge_db**: Copy all new data from one timesearch database into another. Useful for syncing or merging two scans of the same subreddit.
`> timesearch.py merge_db --from filepath/database1.db --to filepath/database2.db`
### To use it
You will need both the `timesearch` package (folder) and the external `timesearch.py` file. You can click the green "Clone or Download" button in the upper right. When you run the .py file, it sends your commandline arguments into the package. You can view a summarized version of all the help text with just `timesearch.py`, or you can view a specific docstring with `timesearch.py livestream`, etc.
When you download this project, the main file that you will execute is `timesearch.py` here in the root directory. It will load the appropriate module to run your command from the modules folder.
You can view a summarized version of all the help text by running `timesearch.py`, and you can view a specific help text by running a command with no arguments, like `timesearch.py livestream`, etc.
I recommend [sqlitebrowser](https://github.com/sqlitebrowser/sqlitebrowser/releases) if you want to inspect the database yourself.
### Changelog
- 2020 01 27
- When I first created Timesearch, it was simply a collection of all the random scripts I had written to archive various things. And they tended to have wacky names like `commentaugment` and `redmash`. Well, since the timesearch toolkit is meant to be a singular cohesive package now I decided to finally rename everything. I believe I have aliased everything properly so the old names still work for backwards compat, except for the fact the modules folder is now called `timesearch_modules` which may break your import statements if you ever imported that on your own.
- 2018 04 09
- Integrated with Pushshift to restore timesearch functionality, speed up commentaugment, and get user comments.