else/OpenDirDL/opendirdl.py

1166 lines
36 KiB
Python
Raw Normal View History

2016-07-29 20:39:04 +00:00
# voussoir
2016-05-21 19:51:36 +00:00
DOCSTRING='''
2016-01-24 20:48:39 +00:00
OpenDirDL
downloads open directories
2016-05-21 19:51:36 +00:00
The basics:
1. Create a database of the directory's files with
2016-07-05 07:24:08 +00:00
> opendirdl digest http://website.com/directory/
2016-05-21 19:51:36 +00:00
2. Enable and disable the files you are interested in with
2016-07-05 07:24:08 +00:00
> opendirdl remove_pattern ".*"
> opendirdl keep_pattern "Daft%20Punk"
> opendirdl remove_pattern "folder\.jpg"
2016-05-21 19:51:36 +00:00
Note the percent-encoded string.
3. Download the enabled files with
2016-07-20 03:31:47 +00:00
> opendirdl download website.com.db
2016-05-21 19:51:36 +00:00
2016-01-24 20:48:39 +00:00
2016-07-20 03:31:47 +00:00
The specifics:
2016-05-21 19:51:36 +00:00
digest:
2016-01-24 20:48:39 +00:00
Recursively fetch directories and build a database of file URLs.
> opendirdl digest http://website.com/directory/ <flags>
2016-05-21 19:51:36 +00:00
> opendirdl digest !clipboard <flags>
2016-01-24 20:48:39 +00:00
flags:
2016-05-10 08:00:29 +00:00
-f | --fullscan:
2016-05-21 19:51:36 +00:00
When included, perform HEAD requests on all files, to know the size of
the entire directory.
2016-05-10 08:00:29 +00:00
-db "x.db" | --databasename "x.db":
2016-05-21 19:51:36 +00:00
Use a custom database filename. By default, databases are named after
the web domain.
2016-01-24 20:48:39 +00:00
2016-05-21 19:51:36 +00:00
download:
Download the files whose URLs are Enabled in the database.
2016-01-24 20:48:39 +00:00
> opendirdl download website.com.db <flags>
flags:
2016-05-10 08:00:29 +00:00
-o "x" | --outputdir "x":
2016-05-21 19:51:36 +00:00
Save the files to a custom directory, "x". By default, files are saved
to a folder named after the web domain.
2016-05-10 08:00:29 +00:00
-ow | --overwrite:
2016-05-21 19:51:36 +00:00
When included, download and overwrite files even if they already exist
in the output directory.
2016-05-10 08:00:29 +00:00
-bps 100 | --bytespersecond 100:
2016-07-29 20:39:04 +00:00
-bps 100k | -bps "100 kb" | -bps 100kib | -bps 1.2m
Ratelimit your download speed. Supports units like "k", "m" according
to `bytestring.parsebytes`.
2016-01-24 20:48:39 +00:00
2016-05-21 19:51:36 +00:00
keep_pattern:
Enable URLs which match a regex pattern. Matches are based on the percent-
encoded strings!
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
> opendirdl keep_pattern website.com.db ".*"
2016-01-24 20:48:39 +00:00
2016-05-21 19:51:36 +00:00
remove_pattern:
Disable URLs which match a regex pattern. Matches are based on the percent-
encoded strings!
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
> opendirdl remove_pattern website.com.db ".*"
2016-01-24 20:48:39 +00:00
2016-05-21 19:51:36 +00:00
list_basenames:
2016-07-29 20:39:04 +00:00
List Enabled URLs alphabetized by their base filename. This makes it easier
to find titles of interest in a directory that is very scattered or poorly
2016-05-21 19:51:36 +00:00
organized.
2016-01-24 20:48:39 +00:00
> opendirdl list_basenames website.com.db <flags>
flags:
2016-05-10 08:00:29 +00:00
-o "x.txt" | --outputfile "x.txt":
2016-05-21 19:51:36 +00:00
Output the results to a file instead of stdout. This is useful if the
filenames contain special characters that crash Python, or are so long
that the console becomes unreadable.
2016-01-24 20:48:39 +00:00
2016-05-21 19:51:36 +00:00
measure:
Sum up the filesizes of all Enabled URLs.
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
> opendirdl measure website.com.db <flags>
2016-01-24 20:48:39 +00:00
flags:
2016-05-10 08:00:29 +00:00
-f | --fullscan:
2016-07-10 04:38:49 +00:00
When included, perform HEAD requests on all files to update their size.
-n | --new_only:
2016-07-20 03:31:47 +00:00
When included, perform HEAD requests only on files that haven't gotten
one yet.
2016-07-10 04:38:49 +00:00
2016-07-20 03:31:47 +00:00
If a file's size is not known by the time this operation completes, you
will receive a printed note.
2016-07-05 07:24:08 +00:00
tree:
Print the file / folder tree.
> opendirdl tree website.com.db <flags>
flags:
-o "x.txt" | --outputfile "x.txt":
Output the results to a file instead of stdout. This is useful if the
filenames contain special characters that crash Python, or are so long
that the console becomes unreadable.
2016-07-20 03:31:47 +00:00
If the filename ends with ".html", the created page will have
collapsible boxes rather than a plaintext diagram.
2016-01-24 20:48:39 +00:00
'''
2016-05-21 19:51:36 +00:00
# Module names preceeded by `## ~` indicate modules that are imported during
2016-01-24 20:48:39 +00:00
# a function, because they are not used anywhere else and we don't need to waste
# time importing them usually.
2016-05-10 08:00:29 +00:00
import sys
2016-07-05 07:24:08 +00:00
# Please consult my github repo for these files
# https://github.com/voussoir/else
2016-07-29 20:39:04 +00:00
sys.path.append('C:\\git\\else\\Downloady'); import downloady
sys.path.append('C:\\git\\else\\Bytestring'); import bytestring
sys.path.append('C:\\git\\else\\Ratelimiter'); import ratelimiter
2016-05-10 08:00:29 +00:00
2016-01-24 20:48:39 +00:00
import argparse
2016-05-21 19:51:36 +00:00
## ~import bs4
2016-07-05 07:24:08 +00:00
import collections
2016-05-21 19:51:36 +00:00
## ~import hashlib
2016-01-17 01:43:17 +00:00
import os
2016-05-21 19:51:36 +00:00
## ~import re
2016-01-17 01:43:17 +00:00
import requests
2016-02-21 07:13:50 +00:00
import shutil
2016-01-24 20:48:39 +00:00
import sqlite3
2016-05-21 19:51:36 +00:00
## ~tkinter
2016-07-10 04:38:49 +00:00
import traceback
2016-01-17 01:43:17 +00:00
import urllib.parse
FILENAME_BADCHARS = '/\\:*?"<>|'
2016-02-21 07:13:50 +00:00
TERMINAL_WIDTH = shutil.get_terminal_size().columns
2016-07-05 07:24:08 +00:00
DOWNLOAD_CHUNK = 16 * bytestring.KIBIBYTE
2016-07-20 03:31:47 +00:00
UNKNOWN_SIZE_STRING = '???'
2016-08-13 00:26:12 +00:00
# When doing a basic scan, we will not send HEAD requests to URLs that end in
# these strings, because they're probably files.
# This isn't meant to be a comprehensive filetype library, but it covers
# enough of the typical opendir to speed things up.
2016-01-17 01:43:17 +00:00
SKIPPABLE_FILETYPES = [
2016-02-21 07:13:50 +00:00
'.aac',
2016-01-17 01:43:17 +00:00
'.avi',
2016-02-21 07:13:50 +00:00
'.bin',
2016-01-17 01:43:17 +00:00
'.bmp',
2016-02-21 07:13:50 +00:00
'.bz2',
2016-01-17 01:43:17 +00:00
'.epub',
2016-02-21 07:13:50 +00:00
'.exe',
2016-01-17 01:43:17 +00:00
'.db',
'.flac',
2016-01-24 20:48:39 +00:00
'.gif',
2016-02-21 07:13:50 +00:00
'.gz',
2016-01-17 01:43:17 +00:00
'.ico',
'.iso',
2016-01-24 20:48:39 +00:00
'.jpeg',
2016-01-17 01:43:17 +00:00
'.jpg',
2016-01-24 20:48:39 +00:00
'.m3u',
2016-01-17 01:43:17 +00:00
'.m4a',
2016-02-21 07:13:50 +00:00
'.m4v',
'.mka',
2016-01-17 01:43:17 +00:00
'.mkv',
'.mov',
'.mp3',
'.mp4',
2016-01-24 20:48:39 +00:00
'.nfo',
'.ogg',
2016-02-21 07:13:50 +00:00
'.ott',
2016-01-17 01:43:17 +00:00
'.pdf',
'.png',
2016-02-21 07:13:50 +00:00
'.rar',
2016-01-17 01:43:17 +00:00
'.srt',
2016-01-24 20:48:39 +00:00
'.tar',
2016-02-21 07:13:50 +00:00
'.ttf',
2016-01-17 01:43:17 +00:00
'.txt',
2016-07-10 04:38:49 +00:00
'.wav',
2016-01-17 01:43:17 +00:00
'.webm',
2016-02-21 07:13:50 +00:00
'.wma',
2016-01-17 01:43:17 +00:00
'.zip',
]
2016-01-24 20:48:39 +00:00
SKIPPABLE_FILETYPES = set(x.lower() for x in SKIPPABLE_FILETYPES)
2016-07-28 03:41:13 +00:00
# Will be ignored completely. Are case-sensitive
BLACKLISTED_FILENAMES = [
'desktop.ini',
'thumbs.db',
]
2016-07-05 07:24:08 +00:00
# oh shit
2016-08-13 00:26:12 +00:00
HTML_TREE_HEAD = '''
<head>
2016-07-05 07:24:08 +00:00
<meta charset="UTF-8">
<script type="text/javascript">
2016-08-03 01:44:32 +00:00
function collapse(div)
2016-07-05 07:24:08 +00:00
{
if (div.style.display != "none")
{
div.style.display = "none";
}
else
{
div.style.display = "block";
}
2016-01-24 20:48:39 +00:00
}
2016-07-05 07:24:08 +00:00
</script>
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
<style>
*
{
font-family: Consolas;
}
2016-07-20 03:31:47 +00:00
2016-08-13 00:26:12 +00:00
.directory_even, .directory_odd
2016-07-05 07:24:08 +00:00
{
padding: 10px;
padding-left: 15px;
margin-bottom: 10px;
border: 1px solid #000;
2016-07-20 03:31:47 +00:00
box-shadow: 1px 1px 2px 0px rgba(0,0,0,0.3);
}
.directory_even
{
background-color: #fff;
}
.directory_odd
{
background-color: #eee;
2016-07-05 07:24:08 +00:00
}
</style>
2016-08-13 00:26:12 +00:00
</head>
2016-07-05 07:24:08 +00:00
'''
2016-01-24 20:48:39 +00:00
2016-08-13 00:26:12 +00:00
HTML_FORMAT_DIRECTORY = '''
<div class="buttonbox">
<button onclick="collapse(this.parentElement.nextElementSibling)">{name} ({size})</button>
{directory_anchor}
</div>
<div class="{css}" style="display:none">
'''.replace('\n', '')
HTML_FORMAT_FILE = '<a href="{url}">{name} ({size})</a><br>'
2016-01-24 20:48:39 +00:00
DB_INIT = '''
CREATE TABLE IF NOT EXISTS urls(
url TEXT,
basename TEXT,
content_length INT,
content_type TEXT,
do_download INT
2016-07-20 03:31:47 +00:00
);
2016-01-24 20:48:39 +00:00
CREATE INDEX IF NOT EXISTS urlindex on urls(url);
CREATE INDEX IF NOT EXISTS baseindex on urls(basename);
CREATE INDEX IF NOT EXISTS sizeindex on urls(content_length);
'''.strip()
SQL_URL = 0
SQL_BASENAME = 1
SQL_CONTENT_LENGTH = 2
SQL_CONTENT_TYPE = 3
SQL_DO_DOWNLOAD = 4
2016-07-10 04:38:49 +00:00
UNMEASURED_WARNING = '''
Note: %d files do not have a stored Content-Length.
2016-07-20 03:31:47 +00:00
Run `measure` with `-f`|`--fullscan` or `-n`|`--new_only` to HEAD request
those files.
2016-07-10 04:38:49 +00:00
'''.strip()
2016-07-20 03:31:47 +00:00
## WALKER ##########################################################################################
## ##
class Walker:
def __init__(self, walkurl, databasename=None, fullscan=False):
2016-07-28 03:41:13 +00:00
if not walkurl.endswith('/'):
2016-07-20 03:31:47 +00:00
walkurl += '/'
2016-08-01 23:42:03 +00:00
if '://' not in walkurl.split('.')[0]:
walkurl = 'http://' + walkurl
2016-07-20 03:31:47 +00:00
self.walkurl = walkurl
2016-07-28 03:41:13 +00:00
if databasename in (None, ''):
2016-08-13 00:26:12 +00:00
domain = url_split(self.walkurl)['root']
2016-07-28 03:41:13 +00:00
databasename = domain + '.db'
2016-07-20 03:31:47 +00:00
databasename = databasename.replace(':', '#')
self.databasename = databasename
2016-07-29 20:39:04 +00:00
write('Opening %s' % self.databasename)
2016-07-20 03:31:47 +00:00
self.sql = sqlite3.connect(self.databasename)
self.cur = self.sql.cursor()
db_init(self.sql, self.cur)
self.fullscan = bool(fullscan)
self.queue = collections.deque()
self.seen_directories = set()
def smart_insert(self, url=None, head=None, commit=True):
'''
See `smart_insert`.
'''
smart_insert(self.sql, self.cur, url=url, head=head, commit=commit)
def extract_hrefs(self, response, tag='a', attribute='href'):
'''
Given a Response object, extract href urls.
2016-07-28 03:41:13 +00:00
External links, index sort links, and blacklisted files are discarded.
2016-07-20 03:31:47 +00:00
'''
import bs4
soup = bs4.BeautifulSoup(response.text, 'html.parser')
2016-07-28 03:41:13 +00:00
elements = soup.find_all(tag)
2016-07-20 03:31:47 +00:00
for element in elements:
try:
href = element[attribute]
except KeyError:
continue
href = urllib.parse.urljoin(response.url, href)
2016-07-28 03:41:13 +00:00
2016-07-20 03:31:47 +00:00
if not href.startswith(self.walkurl):
# Don't go to other sites or parent directories.
continue
2016-07-28 03:41:13 +00:00
2016-07-20 03:31:47 +00:00
if any(sorter in href for sorter in ('?C=', '?O=', '?M=', '?D=', '?N=', '?S=')):
# Alternative sort modes for index pages.
continue
2016-07-28 03:41:13 +00:00
if any(href.endswith(blacklisted) for blacklisted in BLACKLISTED_FILENAMES):
2016-07-20 03:31:47 +00:00
continue
2016-07-28 03:41:13 +00:00
2016-07-20 03:31:47 +00:00
yield href
def process_url(self, url=None):
'''
Given a URL, check whether it is an index page or an actual file.
2016-08-03 01:44:32 +00:00
If it is an index page, its links are extracted and queued.
2016-07-20 03:31:47 +00:00
If it is a file, its information is saved to the database.
We perform a
HEAD:
when `self.fullscan` is True.
when `self.fullscan` is False but the url is not a SKIPPABLE_FILETYPE.
when the url is an index page.
GET:
2016-08-09 08:33:36 +00:00
when the url is an index page.
2016-07-20 03:31:47 +00:00
'''
if url is None:
url = self.walkurl
else:
url = urllib.parse.urljoin(self.walkurl, url)
if url in self.seen_directories:
# We already picked this up at some point
return
if not url.startswith(self.walkurl):
# Don't follow external links or parent directory.
2016-07-29 20:39:04 +00:00
write('Skipping "%s" due to external url.' % url)
2016-07-20 03:31:47 +00:00
return
urll = url.lower()
if self.fullscan is False:
skippable = any(urll.endswith(ext) for ext in SKIPPABLE_FILETYPES)
if skippable:
2016-07-29 20:39:04 +00:00
write('Skipping "%s" due to extension.' % url)
2016-07-20 03:31:47 +00:00
self.smart_insert(url=url, commit=False)
return
self.cur.execute('SELECT * FROM urls WHERE url == ?', [url])
skippable = self.cur.fetchone() is not None
if skippable:
2016-07-29 20:39:04 +00:00
write('Skipping "%s" since we already have it.' % url)
2016-07-20 03:31:47 +00:00
return
try:
head = do_head(url)
except requests.exceptions.HTTPError as e:
if e.response.status_code == 403:
2016-08-03 01:44:32 +00:00
write('403 FORBIDDEN!')
2016-07-20 03:31:47 +00:00
return
if e.response.status_code == 404:
2016-08-03 01:44:32 +00:00
write('404 NOT FOUND!')
2016-07-20 03:31:47 +00:00
return
raise
content_type = head.headers.get('Content-Type', '?')
#print(content_type)
if content_type.startswith('text/html'):# and head.url.endswith('/'):
# This is an index page, so extract links and queue them.
response = do_get(url)
hrefs = self.extract_hrefs(response)
# Just in case the URL we used is different than the real one,
# such as missing a trailing slash, add both.
self.seen_directories.add(url)
self.seen_directories.add(head.url)
added = 0
for href in hrefs:
if href in self.seen_directories:
continue
else:
self.queue.append(href)
added += 1
2016-08-03 01:44:32 +00:00
write('Queued %d urls' % added)
2016-07-20 03:31:47 +00:00
else:
# This is not an index page, so save it.
self.smart_insert(head=head, commit=False)
def walk(self, url=None):
self.queue.appendleft(url)
try:
while len(self.queue) > 0:
url = self.queue.popleft()
self.process_url(url)
line = '{:,} Remaining'.format(len(self.queue))
2016-08-03 01:44:32 +00:00
write(line)
2016-07-20 03:31:47 +00:00
except:
self.sql.commit()
raise
self.sql.commit()
## ##
## WALKER ##########################################################################################
## OTHER CLASSES ###################################################################################
2016-01-24 20:48:39 +00:00
## ##
class Generic:
def __init__(self, **kwargs):
2016-08-03 01:44:32 +00:00
for (key, value) in kwargs.items():
setattr(self, key, value)
2016-07-05 07:24:08 +00:00
2016-07-10 04:38:49 +00:00
class TreeExistingChild(Exception):
pass
class TreeInvalidIdentifier(Exception):
pass
2016-07-05 07:24:08 +00:00
class TreeNode:
2016-08-03 01:44:32 +00:00
def __init__(self, identifier, data=None):
2016-07-05 07:24:08 +00:00
assert isinstance(identifier, str)
assert '\\' not in identifier
self.identifier = identifier
self.data = data
2016-08-03 02:06:59 +00:00
self.parent = None
2016-07-05 07:24:08 +00:00
self.children = {}
2016-08-03 01:44:32 +00:00
def __eq__(self, other):
2016-08-13 00:26:12 +00:00
return isinstance(other, TreeNode) and self.abspath() == other.abspath()
2016-08-03 01:44:32 +00:00
2016-07-05 07:24:08 +00:00
def __getitem__(self, key):
return self.children[key]
2016-08-03 01:44:32 +00:00
def __hash__(self):
return hash(self.abspath())
2016-07-05 07:24:08 +00:00
def __repr__(self):
return 'TreeNode %s' % self.abspath()
def abspath(self):
node = self
nodes = [node]
while node.parent is not None:
node = node.parent
nodes.append(node)
nodes.reverse()
nodes = [node.identifier for node in nodes]
return '\\'.join(nodes)
def add_child(self, other_node, overwrite_parent=False):
self.check_child_availability(other_node.identifier)
if other_node.parent is not None and not overwrite_parent:
raise ValueError('That node already has a parent. Try `overwrite_parent=True`')
other_node.parent = self
self.children[other_node.identifier] = other_node
return other_node
def check_child_availability(self, identifier):
if identifier in self.children:
2016-07-10 04:38:49 +00:00
raise TreeExistingChild('Node %s already has child %s' % (self.identifier, identifier))
2016-07-05 07:24:08 +00:00
def detach(self):
del self.parent.children[self.identifier]
self.parent = None
2016-07-29 20:39:04 +00:00
def list_children(self, customsort=None):
children = list(self.children.values())
2016-07-05 07:24:08 +00:00
if customsort is None:
2016-07-29 20:39:04 +00:00
children.sort(key=lambda node: node.identifier.lower())
2016-07-05 07:24:08 +00:00
else:
2016-07-29 20:39:04 +00:00
children.sort(key=customsort)
return children
2016-07-05 07:24:08 +00:00
def merge_other(self, othertree, otherroot=None):
newroot = None
if ':' in othertree.identifier:
if otherroot is None:
raise Exception('Must specify a new name for the other tree\'s root')
else:
newroot = otherroot
else:
newroot = othertree.identifier
othertree.identifier = newroot
othertree.parent = self
self.check_child_availability(newroot)
self.children[newroot] = othertree
def walk(self, customsort=None):
yield self
2016-08-03 01:44:32 +00:00
for child in self.list_children(customsort=customsort):
2016-07-05 07:24:08 +00:00
yield from child.walk(customsort=customsort)
2016-01-24 20:48:39 +00:00
## ##
2016-07-20 03:31:47 +00:00
## OTHER CLASSES ###################################################################################
2016-01-24 20:48:39 +00:00
## GENERAL FUNCTIONS ###############################################################################
## ##
2016-07-29 20:39:04 +00:00
def build_file_tree(databasename):
sql = sqlite3.connect(databasename)
cur = sql.cursor()
cur.execute('SELECT * FROM urls WHERE do_download == 1')
2016-08-03 01:44:32 +00:00
all_items = cur.fetchall()
2016-07-29 20:39:04 +00:00
sql.close()
2016-08-03 01:44:32 +00:00
if len(all_items) == 0:
return
2016-07-29 20:39:04 +00:00
2016-08-03 01:44:32 +00:00
path_form = '{root}\\{folder}\\{filename}'
all_items = [
{
'url': item[SQL_URL],
'size': item[SQL_CONTENT_LENGTH],
2016-08-13 00:26:12 +00:00
'path_parts': path_form.format(**url_split(item[SQL_URL])).split('\\'),
2016-08-03 01:44:32 +00:00
}
for item in all_items
]
all_items.sort(key=lambda x: x['url'])
root_data = {
'item_type': 'directory',
'name': databasename,
}
2016-08-13 00:26:12 +00:00
scheme = url_split(all_items[0]['url'])['scheme']
2016-08-03 01:44:32 +00:00
tree = TreeNode(databasename, data=root_data)
tree.unsorted_children = all_items
node_queue = set()
node_queue.add(tree)
# In this process, URLs are divided up into their nodes one directory layer at a time.
# The root receives all URLs, and creates nodes for each of the top-level
# directories. Those nodes receive all subdirectories, and repeat.
while len(node_queue) > 0:
node = node_queue.pop()
2016-08-13 00:26:12 +00:00
for new_child_data in node.unsorted_children:
path_parts = new_child_data['path_parts']
# Create a new node for the directory, path_parts[0]
# path_parts[1:] is assigned to that node to be divided next.
child_identifier = path_parts.pop(0)
#child_identifier = child_identifier.replace(':', '#')
2016-08-03 01:44:32 +00:00
child = node.children.get(child_identifier, None)
if not child:
child = TreeNode(child_identifier, data={})
child.unsorted_children = []
node.add_child(child)
child.data['name'] = child_identifier
2016-08-13 00:26:12 +00:00
if len(path_parts) > 0:
2016-08-03 01:44:32 +00:00
child.data['item_type'] = 'directory'
2016-08-13 00:26:12 +00:00
child.unsorted_children.append(new_child_data)
2016-08-03 01:44:32 +00:00
node_queue.add(child)
2016-07-29 20:39:04 +00:00
else:
2016-08-03 01:44:32 +00:00
child.data['item_type'] = 'file'
2016-08-13 00:26:12 +00:00
child.data['size'] = new_child_data['size']
child.data['url'] = new_child_data['url']
if node.parent is None:
continue
elif node.parent == tree:
node.data['url'] = scheme + '://' + node.identifier
else:
node.data['url'] = node.parent.data['url'] + '/' + node.identifier
2016-07-29 20:39:04 +00:00
2016-08-03 01:44:32 +00:00
del node.unsorted_children
2016-07-29 20:39:04 +00:00
return tree
2016-01-24 20:48:39 +00:00
def db_init(sql, cur):
lines = DB_INIT.split(';')
for line in lines:
cur.execute(line)
sql.commit()
return True
2016-01-17 01:43:17 +00:00
2016-07-05 07:24:08 +00:00
def do_get(url, raise_for_status=True):
2016-08-03 01:44:32 +00:00
return do_request('GET', requests.get, url, raise_for_status=raise_for_status)
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
def do_head(url, raise_for_status=True):
2016-08-03 01:44:32 +00:00
return do_request('HEAD', requests.head, url, raise_for_status=raise_for_status)
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
def do_request(message, method, url, raise_for_status=True):
2016-01-24 20:48:39 +00:00
message = '{message:>4s}: {url} : '.format(message=message, url=url)
2016-07-29 20:39:04 +00:00
write(message, end='', flush=True)
2016-01-24 20:48:39 +00:00
response = method(url)
2016-07-29 20:39:04 +00:00
write(response.status_code)
2016-07-05 07:24:08 +00:00
if raise_for_status:
response.raise_for_status()
2016-01-24 20:48:39 +00:00
return response
2016-07-28 03:41:13 +00:00
def fetch_generator(cur):
while True:
fetch = cur.fetchone()
if fetch is None:
break
yield fetch
2016-01-24 20:48:39 +00:00
def filepath_sanitize(text, allowed=''):
2016-07-28 03:41:13 +00:00
badchars = FILENAME_BADCHARS
2016-08-18 01:24:38 +00:00
badchars = set(char for char in FILENAME_BADCHARS if char not in allowed)
2016-07-28 03:41:13 +00:00
text = ''.join(char for char in text if char not in badchars)
2016-01-17 01:43:17 +00:00
return text
2016-01-24 20:48:39 +00:00
def get_clipboard():
import tkinter
t = tkinter.Tk()
clip = t.clipboard_get()
t.destroy()
return clip
2016-01-17 01:43:17 +00:00
def hashit(text, length=None):
2016-01-24 20:48:39 +00:00
import hashlib
2016-01-17 01:43:17 +00:00
h = hashlib.sha512(text.encode('utf-8')).hexdigest()
if length is not None:
h = h[:length]
return h
2016-01-24 20:48:39 +00:00
def listget(l, index, default=None):
try:
return l[index]
except IndexError:
return default
def longest_length(li):
longest = 0
for item in li:
longest = max(longest, len(item))
return longest
2016-07-29 20:39:04 +00:00
def recursive_get_size(node):
'''
Calculate the size of the Directory nodes by summing the sizes of all children.
Modifies the nodes in-place.
'''
return_value = {
'size': 0,
'unmeasured': 0,
}
if node.data['item_type'] == 'file':
if node.data['size'] is None:
return_value['unmeasured'] = 1
return_value['size'] = node.data['size']
else:
for child in node.list_children():
child_details = recursive_get_size(child)
return_value['size'] += child_details['size'] or 0
return_value['unmeasured'] += child_details['unmeasured']
node.data['size'] = return_value['size']
return return_value
def recursive_print_node(node, depth=0, use_html=False, output_file=None):
'''
Given a tree node (presumably the root), print it and all of its children.
'''
size = node.data['size']
if size is None:
size = UNKNOWN_SIZE_STRING
else:
size = bytestring.bytestring(size)
if use_html:
css_class = 'directory_even' if depth % 2 == 0 else 'directory_odd'
if node.data['item_type'] == 'directory':
2016-08-13 00:26:12 +00:00
directory_url = node.data.get('url')
directory_anchor = '<a href="{url}">►</a>' if directory_url else ''
directory_anchor = directory_anchor.format(url=directory_url)
line = HTML_FORMAT_DIRECTORY.format(
css=css_class,
directory_anchor=directory_anchor,
name=node.data['name'],
size=size,
)
2016-07-29 20:39:04 +00:00
else:
2016-08-13 00:26:12 +00:00
line = HTML_FORMAT_FILE.format(
name=node.data['name'],
size=size,
url=node.data['url'],
)
2016-07-29 20:39:04 +00:00
else:
line = '{space}{bar}{name} : ({size})'
line = line.format(
space='| ' * (depth-1),
bar='|---' if depth > 0 else '',
name=node.data['name'],
size=size
)
write(line, output_file)
# Sort by type (directories first) then subsort by lowercase path
customsort = lambda node: (
node.data['item_type'] == 'file',
node.data['url'].lower(),
)
for child in node.list_children(customsort=customsort):
recursive_print_node(child, depth=depth+1, use_html=use_html, output_file=output_file)
if node.data['item_type'] == 'directory':
if use_html:
2016-08-13 00:26:12 +00:00
# Close the directory div
2016-07-29 20:39:04 +00:00
write('</div>', output_file)
else:
# This helps put some space between sibling directories
write('| ' * (depth), output_file)
2016-01-17 01:43:17 +00:00
def safeprint(text, **kwargs):
text = str(text)
text = text.encode('ascii', 'replace').decode()
2016-07-29 20:39:04 +00:00
#text = text.replace('?', '_')
2016-01-17 01:43:17 +00:00
print(text, **kwargs)
2016-01-24 20:48:39 +00:00
def smart_insert(sql, cur, url=None, head=None, commit=True):
'''
2016-07-05 07:24:08 +00:00
INSERT or UPDATE the appropriate entry, or DELETE if the head
shows a 403 / 404.
2016-01-24 20:48:39 +00:00
'''
2016-08-13 00:26:12 +00:00
if bool(url) is bool(head) and not isinstance(head, requests.Response):
2016-01-24 20:48:39 +00:00
raise ValueError('One and only one of `url` or `head` is necessary.')
if url is not None:
# When doing a basic scan, all we get is the URL.
content_length = None
content_type = None
elif head is not None:
2016-08-13 00:26:12 +00:00
url = head.url
2016-01-24 20:48:39 +00:00
# When doing a full scan, we get a Response object.
2016-07-05 07:24:08 +00:00
if head.status_code in [403, 404]:
cur.execute('DELETE FROM urls WHERE url == ?', [url])
if commit:
sql.commit()
return (url, None, 0, None, 0)
else:
url = head.url
content_length = head.headers.get('Content-Length', None)
if content_length is not None:
content_length = int(content_length)
content_type = head.headers.get('Content-Type', None)
2016-01-24 20:48:39 +00:00
2016-08-13 00:26:12 +00:00
basename = url_split(url)['filename']
2016-01-24 20:48:39 +00:00
basename = urllib.parse.unquote(basename)
do_download = True
2016-07-05 07:24:08 +00:00
2016-01-24 20:48:39 +00:00
cur.execute('SELECT * FROM urls WHERE url == ?', [url])
existing_entry = cur.fetchone()
is_new = existing_entry is None
2016-07-05 07:24:08 +00:00
2016-01-24 20:48:39 +00:00
data = (url, basename, content_length, content_type, do_download)
if is_new:
cur.execute('INSERT INTO urls VALUES(?, ?, ?, ?, ?)', data)
else:
command = '''
UPDATE urls SET
content_length = coalesce(?, content_length),
content_type = coalesce(?, content_type)
WHERE url == ?
'''
cur.execute(command, [content_length, content_type, url])
2016-07-05 07:24:08 +00:00
2016-01-24 20:48:39 +00:00
if commit:
sql.commit()
return data
2016-08-13 00:26:12 +00:00
def url_split(text):
2016-01-17 01:43:17 +00:00
text = urllib.parse.unquote(text)
parts = urllib.parse.urlsplit(text)
2016-08-01 23:42:03 +00:00
if any(part == '' for part in [parts.scheme, parts.netloc]):
raise ValueError('Not a valid URL')
2016-07-05 07:24:08 +00:00
scheme = parts.scheme
2016-01-17 01:43:17 +00:00
root = parts.netloc
(folder, filename) = os.path.split(parts.path)
while folder.startswith('/'):
folder = folder[1:]
2016-01-24 20:48:39 +00:00
# Folders are allowed to have slashes...
folder = filepath_sanitize(folder, allowed='/\\')
2016-01-17 01:43:17 +00:00
folder = folder.replace('\\', os.path.sep)
folder = folder.replace('/', os.path.sep)
2016-01-24 20:48:39 +00:00
# ...but Files are not.
2016-01-17 01:43:17 +00:00
filename = filepath_sanitize(filename)
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
result = {
'scheme': scheme,
'root': root,
'folder': folder,
'filename': filename,
}
return result
2016-07-20 03:31:47 +00:00
2016-07-29 20:39:04 +00:00
def write(line, file_handle=None, **kwargs):
2016-07-20 03:31:47 +00:00
if file_handle is None:
2016-07-29 20:39:04 +00:00
safeprint(line, **kwargs)
2016-07-20 03:31:47 +00:00
else:
2016-07-29 20:39:04 +00:00
file_handle.write(line + '\n', **kwargs)
2016-01-24 20:48:39 +00:00
## ##
## GENERAL FUNCTIONS ###############################################################################
## COMMANDLINE FUNCTIONS ###########################################################################
## ##
2016-08-09 08:33:36 +00:00
def digest(walkurl, databasename=None, fullscan=False):
2016-07-05 07:24:08 +00:00
if walkurl in ('!clipboard', '!c'):
2016-01-24 20:48:39 +00:00
walkurl = get_clipboard()
2016-07-29 20:39:04 +00:00
write('From clipboard: %s' % walkurl)
2016-01-24 20:48:39 +00:00
walker = Walker(
2016-07-05 07:24:08 +00:00
databasename=databasename,
2016-01-24 20:48:39 +00:00
fullscan=fullscan,
walkurl=walkurl,
2016-07-20 03:31:47 +00:00
)
2016-01-24 20:48:39 +00:00
walker.walk()
2016-07-05 07:24:08 +00:00
def digest_argparse(args):
return digest(
databasename=args.databasename,
fullscan=args.fullscan,
2016-08-09 08:33:36 +00:00
walkurl=args.walkurl,
2016-07-05 07:24:08 +00:00
)
2016-07-28 03:41:13 +00:00
def download(
databasename,
outputdir=None,
bytespersecond=None,
headers=None,
overwrite=False,
):
'''
Download all of the Enabled files. The filepaths will match that of the
website, using `outputdir` as the root directory.
Parameters:
outputdir:
The directory to mirror the files into. If not provided, the domain
name is used.
bytespersecond:
The speed to ratelimit the downloads. Can be an integer, or a string like
'500k', according to the capabilities of `bytestring.parsebytes`
Note that this is bytes, not bits.
headers:
Additional headers to pass to each `download_file` call.
overwrite:
If True, delete local copies of existing files and rewrite them.
Otherwise, completed files are skipped.
'''
sql = sqlite3.connect(databasename)
cur = sql.cursor()
if outputdir in (None, ''):
# This assumes that all URLs in the database are from the same domain.
2016-08-01 23:42:03 +00:00
# If they aren't, it's the user's fault because Walkers don't leave the given site
# on their own.
2016-07-28 03:41:13 +00:00
cur.execute('SELECT url FROM urls LIMIT 1')
url = cur.fetchone()[0]
2016-08-13 00:26:12 +00:00
outputdir = url_split(url)['root']
2016-07-28 03:41:13 +00:00
2016-01-24 20:48:39 +00:00
if isinstance(bytespersecond, str):
2016-07-28 03:41:13 +00:00
bytespersecond = bytestring.parsebytes(bytespersecond)
2016-01-24 20:48:39 +00:00
2016-07-28 03:41:13 +00:00
cur.execute('SELECT * FROM urls WHERE do_download == 1 ORDER BY url')
for fetch in fetch_generator(cur):
url = fetch[SQL_URL]
2016-08-13 00:26:12 +00:00
url_filepath = url_split(url)
2016-07-28 03:41:13 +00:00
folder = os.path.join(outputdir, url_filepath['folder'])
os.makedirs(folder, exist_ok=True)
2016-08-18 01:24:38 +00:00
fullname = os.path.join(folder, url_filepath['filename'])
2016-07-28 03:41:13 +00:00
2016-08-18 01:24:38 +00:00
write('Downloading "%s"' % fullname)
2016-07-29 20:39:04 +00:00
downloady.download_file(
url,
2016-08-18 01:24:38 +00:00
localname=fullname,
2016-07-29 20:39:04 +00:00
bytespersecond=bytespersecond,
callback_progress=downloady.progress2,
overwrite=overwrite
)
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
def download_argparse(args):
return download(
databasename=args.databasename,
outputdir=args.outputdir,
overwrite=args.overwrite,
bytespersecond=args.bytespersecond,
)
2016-01-24 20:48:39 +00:00
def filter_pattern(databasename, regex, action='keep', *trash):
'''
When `action` is 'keep', then any URLs matching the regex will have their
`do_download` flag set to True.
2016-01-17 01:43:17 +00:00
2016-01-24 20:48:39 +00:00
When `action` is 'remove', then any URLs matching the regex will have their
`do_download` flag set to False.
2016-07-20 03:31:47 +00:00
Actions will not act on each other's behalf. Keep will NEVER disable a url,
and remove will NEVER enable one.
2016-01-24 20:48:39 +00:00
'''
import re
if isinstance(regex, str):
regex = [regex]
2016-01-17 01:43:17 +00:00
2016-01-24 20:48:39 +00:00
keep = action == 'keep'
remove = action == 'remove'
2016-01-17 01:43:17 +00:00
2016-01-24 20:48:39 +00:00
sql = sqlite3.connect(databasename)
cur = sql.cursor()
2016-07-28 03:41:13 +00:00
cur.execute('SELECT * FROM urls')
items = cur.fetchall()
for item in items:
url = item[SQL_URL]
current_do_dl = item[SQL_DO_DOWNLOAD]
2016-01-24 20:48:39 +00:00
for pattern in regex:
contains = re.search(pattern, url) is not None
should_keep = (keep and contains)
if keep and contains and not current_do_dl:
2016-07-29 20:39:04 +00:00
write('Enabling "%s"' % url)
2016-01-24 20:48:39 +00:00
cur.execute('UPDATE urls SET do_download = 1 WHERE url == ?', [url])
if remove and contains and current_do_dl:
2016-07-29 20:39:04 +00:00
write('Disabling "%s"' % url)
2016-01-24 20:48:39 +00:00
cur.execute('UPDATE urls SET do_download = 0 WHERE url == ?', [url])
sql.commit()
2016-07-20 03:31:47 +00:00
def keep_pattern_argparse(args):
2016-01-24 20:48:39 +00:00
'''
See `filter_pattern`.
2016-01-17 01:43:17 +00:00
'''
2016-07-20 03:31:47 +00:00
return filter_pattern(
2016-01-24 20:48:39 +00:00
action='keep',
databasename=args.databasename,
regex=args.regex,
2016-07-20 03:31:47 +00:00
)
2016-01-17 01:43:17 +00:00
2016-07-20 03:31:47 +00:00
def list_basenames(databasename, output_filename=None):
2016-01-17 01:43:17 +00:00
'''
2016-07-20 03:31:47 +00:00
Print the Enabled entries in order of the file basenames.
2016-01-24 20:48:39 +00:00
This makes it easier to find interesting titles without worrying about
what directory they're in.
'''
sql = sqlite3.connect(databasename)
cur = sql.cursor()
2016-07-20 03:31:47 +00:00
cur.execute('SELECT * FROM urls WHERE do_download == 1')
items = cur.fetchall()
2016-07-20 03:35:35 +00:00
longest = max(items, key=lambda x: len(x[SQL_BASENAME]))
longest = len(longest[SQL_BASENAME])
2016-07-20 03:31:47 +00:00
items.sort(key=lambda x: x[SQL_BASENAME].lower())
if output_filename is not None:
output_file = open(output_filename, 'w', encoding='utf-8')
2016-07-20 03:35:35 +00:00
else:
output_file = None
form = '{basename:<%ds} : {url} : {size}' % longest
2016-07-20 03:31:47 +00:00
for item in items:
size = item[SQL_CONTENT_LENGTH]
if size is None:
size = ''
2016-01-24 20:48:39 +00:00
else:
2016-07-20 03:31:47 +00:00
size = bytestring.bytestring(size)
line = form.format(
basename=item[SQL_BASENAME],
url=item[SQL_URL],
size=size,
)
2016-07-20 03:35:35 +00:00
write(line, output_file)
2016-07-20 03:31:47 +00:00
if output_file:
output_file.close()
2016-01-17 01:43:17 +00:00
2016-07-05 07:24:08 +00:00
def list_basenames_argparse(args):
return list_basenames(
databasename=args.databasename,
2016-07-20 03:31:47 +00:00
output_filename=args.outputfile,
2016-07-05 07:24:08 +00:00
)
2016-07-10 04:38:49 +00:00
def measure(databasename, fullscan=False, new_only=False):
2016-01-24 20:48:39 +00:00
'''
Given a database, print the sum of all Content-Lengths.
2016-07-10 04:38:49 +00:00
URLs will be HEAD requested if:
`new_only` is True and the file has no stored content length, or
`fullscan` is True and `new_only` is False
2016-01-24 20:48:39 +00:00
'''
if isinstance(fullscan, str):
fullscan = bool(fullscan)
totalsize = 0
sql = sqlite3.connect(databasename)
2016-07-10 04:38:49 +00:00
cur = sql.cursor()
if new_only:
cur.execute('SELECT * FROM urls WHERE do_download == 1 AND content_length IS NULL')
else:
cur.execute('SELECT * FROM urls WHERE do_download == 1')
items = cur.fetchall()
2016-08-13 00:26:12 +00:00
filecount = len(items)
2016-07-05 07:24:08 +00:00
unmeasured_file_count = 0
2016-08-09 08:33:36 +00:00
for fetch in items:
size = fetch[SQL_CONTENT_LENGTH]
2016-07-05 07:24:08 +00:00
2016-08-09 08:33:36 +00:00
if fullscan or new_only:
url = fetch[SQL_URL]
head = do_head(url, raise_for_status=False)
fetch = smart_insert(sql, cur, head=head, commit=True)
size = fetch[SQL_CONTENT_LENGTH]
2016-07-05 07:24:08 +00:00
2016-08-13 00:26:12 +00:00
elif size is None:
# Unmeasured and no intention to measure.
2016-08-09 08:33:36 +00:00
unmeasured_file_count += 1
size = 0
2016-08-13 00:26:12 +00:00
if size is None:
# Unmeasured even though we tried the head request.
write('"%s" is not revealing Content-Length' % url)
size = 0
2016-08-09 08:33:36 +00:00
totalsize += size
2016-01-24 20:48:39 +00:00
sql.commit()
2016-08-13 00:26:12 +00:00
size_string = bytestring.bytestring(totalsize)
totalsize_string = '{size_short} ({size_exact:,} bytes) in {filecount:,} files'
totalsize_string = totalsize_string.format(
size_short=size_string,
size_exact=totalsize,
filecount=filecount,
)
2016-08-03 01:44:32 +00:00
write(totalsize_string)
2016-07-05 07:24:08 +00:00
if unmeasured_file_count > 0:
2016-08-03 01:44:32 +00:00
write(UNMEASURED_WARNING % unmeasured_file_count)
2016-01-24 20:48:39 +00:00
return totalsize
2016-07-05 07:24:08 +00:00
def measure_argparse(args):
return measure(
databasename=args.databasename,
fullscan=args.fullscan,
2016-07-10 04:38:49 +00:00
new_only=args.new_only,
2016-07-05 07:24:08 +00:00
)
2016-07-20 03:31:47 +00:00
def remove_pattern_argparse(args):
2016-01-24 20:48:39 +00:00
'''
See `filter_pattern`.
'''
2016-07-20 03:31:47 +00:00
return filter_pattern(
2016-01-24 20:48:39 +00:00
action='remove',
databasename=args.databasename,
regex=args.regex,
2016-07-20 03:31:47 +00:00
)
2016-07-05 07:24:08 +00:00
def tree(databasename, output_filename=None):
2016-07-20 03:31:47 +00:00
'''
Print a tree diagram of the directory-file structure.
If an .html file is given for `output_filename`, the page will have
collapsible boxes and clickable filenames. Otherwise the file will just
be a plain text drawing.
'''
2016-07-29 20:39:04 +00:00
tree = build_file_tree(databasename)
2016-07-05 07:24:08 +00:00
if output_filename is not None:
output_file = open(output_filename, 'w', encoding='utf-8')
2016-07-10 04:38:49 +00:00
use_html = output_filename.lower().endswith('.html')
else:
output_file = None
use_html = False
2016-07-05 07:24:08 +00:00
if use_html:
2016-08-13 00:26:12 +00:00
write('<!DOCTYPE html>\n<html>', output_file)
write(HTML_TREE_HEAD, output_file)
write('<body>', output_file)
2016-07-05 07:24:08 +00:00
2016-07-29 20:39:04 +00:00
size_details = recursive_get_size(tree)
recursive_print_node(tree, use_html=use_html, output_file=output_file)
if size_details['unmeasured'] > 0:
write(UNMEASURED_WARNING % size_details['unmeasured'], output_file)
2016-07-05 07:24:08 +00:00
if output_file is not None:
2016-08-13 00:26:12 +00:00
if use_html:
write('</body>\n</html>', output_file)
2016-07-05 07:24:08 +00:00
output_file.close()
return tree
def tree_argparse(args):
return tree(
databasename=args.databasename,
output_filename=args.outputfile,
)
2016-01-24 20:48:39 +00:00
## ##
## COMMANDLINE FUNCTIONS ###########################################################################
2016-07-20 03:31:47 +00:00
def main(argv):
if listget(argv, 1, '').lower() in ('help', '-h', '--help', ''):
2016-08-03 01:44:32 +00:00
write(DOCSTRING)
2016-07-20 03:31:47 +00:00
return
2016-01-24 20:48:39 +00:00
parser = argparse.ArgumentParser()
subparsers = parser.add_subparsers()
p_digest = subparsers.add_parser('digest')
p_digest.add_argument('walkurl')
p_digest.add_argument('-db', '--database', dest='databasename', default=None)
2016-07-05 07:24:08 +00:00
p_digest.add_argument('-f', '--fullscan', dest='fullscan', action='store_true')
p_digest.set_defaults(func=digest_argparse)
2016-01-24 20:48:39 +00:00
p_download = subparsers.add_parser('download')
p_download.add_argument('databasename')
p_download.add_argument('-o', '--outputdir', dest='outputdir', default=None)
p_download.add_argument('-bps', '--bytespersecond', dest='bytespersecond', default=None)
2016-07-05 07:24:08 +00:00
p_download.add_argument('-ow', '--overwrite', dest='overwrite', action='store_true')
p_download.set_defaults(func=download_argparse)
2016-01-24 20:48:39 +00:00
p_keep_pattern = subparsers.add_parser('keep_pattern')
p_keep_pattern.add_argument('databasename')
p_keep_pattern.add_argument('regex')
2016-07-20 03:31:47 +00:00
p_keep_pattern.set_defaults(func=keep_pattern_argparse)
2016-01-24 20:48:39 +00:00
p_list_basenames = subparsers.add_parser('list_basenames')
p_list_basenames.add_argument('databasename')
2016-07-05 07:24:08 +00:00
p_list_basenames.add_argument('-o', '--outputfile', dest='outputfile', default=None)
p_list_basenames.set_defaults(func=list_basenames_argparse)
2016-01-24 20:48:39 +00:00
p_measure = subparsers.add_parser('measure')
p_measure.add_argument('databasename')
2016-07-05 07:24:08 +00:00
p_measure.add_argument('-f', '--fullscan', dest='fullscan', action='store_true')
2016-07-10 04:38:49 +00:00
p_measure.add_argument('-n', '--new_only', dest='new_only', action='store_true')
2016-07-05 07:24:08 +00:00
p_measure.set_defaults(func=measure_argparse)
2016-01-24 20:48:39 +00:00
p_remove_pattern = subparsers.add_parser('remove_pattern')
p_remove_pattern.add_argument('databasename')
p_remove_pattern.add_argument('regex')
2016-07-20 03:31:47 +00:00
p_remove_pattern.set_defaults(func=remove_pattern_argparse)
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
p_tree = subparsers.add_parser('tree')
p_tree.add_argument('databasename')
p_tree.add_argument('-o', '--outputfile', dest='outputfile', default=None)
p_tree.set_defaults(func=tree_argparse)
2016-07-20 03:31:47 +00:00
args = parser.parse_args(argv)
2016-01-24 20:48:39 +00:00
args.func(args)
2016-07-20 03:31:47 +00:00
if __name__ == '__main__':
main(sys.argv[1:])