else/OpenDirDL/opendirdl.py

1009 lines
32 KiB
Python
Raw Normal View History

2016-07-29 20:39:04 +00:00
# voussoir
2016-10-04 02:20:58 +00:00
'''
2016-01-24 20:48:39 +00:00
OpenDirDL
downloads open directories
2016-05-21 19:51:36 +00:00
The basics:
1. Create a database of the directory's files with
2016-07-05 07:24:08 +00:00
> opendirdl digest http://website.com/directory/
2016-05-21 19:51:36 +00:00
2. Enable and disable the files you are interested in with
2016-11-12 00:01:59 +00:00
> opendirdl website.com.db remove_pattern ".*"
> opendirdl website.com.db keep_pattern "Daft%20Punk"
> opendirdl website.com.db remove_pattern "folder\.jpg"
2016-05-21 19:51:36 +00:00
Note the percent-encoded string.
3. Download the enabled files with
2016-11-12 00:01:59 +00:00
> opendirdl website.com.db download
2016-05-21 19:51:36 +00:00
2016-01-24 20:48:39 +00:00
2016-07-20 03:31:47 +00:00
The specifics:
2016-05-21 19:51:36 +00:00
digest:
2016-01-24 20:48:39 +00:00
Recursively fetch directories and build a database of file URLs.
> opendirdl digest http://website.com/directory/ <flags>
2016-05-21 19:51:36 +00:00
> opendirdl digest !clipboard <flags>
2016-01-24 20:48:39 +00:00
flags:
2016-05-10 08:00:29 +00:00
-f | --fullscan:
2016-05-21 19:51:36 +00:00
When included, perform HEAD requests on all files, to know the size of
the entire directory.
2016-05-10 08:00:29 +00:00
-db "x.db" | --databasename "x.db":
2016-05-21 19:51:36 +00:00
Use a custom database filename. By default, databases are named after
the web domain.
2016-01-24 20:48:39 +00:00
2016-05-21 19:51:36 +00:00
download:
Download the files whose URLs are Enabled in the database.
2016-01-24 20:48:39 +00:00
> opendirdl download website.com.db <flags>
flags:
2016-05-10 08:00:29 +00:00
-o "x" | --outputdir "x":
2016-05-21 19:51:36 +00:00
Save the files to a custom directory, "x". By default, files are saved
to a folder named after the web domain.
2016-05-10 08:00:29 +00:00
-ow | --overwrite:
2016-05-21 19:51:36 +00:00
When included, download and overwrite files even if they already exist
in the output directory.
2016-05-10 08:00:29 +00:00
-bps 100 | --bytespersecond 100:
2016-07-29 20:39:04 +00:00
-bps 100k | -bps "100 kb" | -bps 100kib | -bps 1.2m
Ratelimit your download speed. Supports units like "k", "m" according
to `bytestring.parsebytes`.
2016-01-24 20:48:39 +00:00
2016-05-21 19:51:36 +00:00
keep_pattern:
Enable URLs which match a regex pattern. Matches are based on the percent-
encoded strings!
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
> opendirdl keep_pattern website.com.db ".*"
2016-01-24 20:48:39 +00:00
2016-05-21 19:51:36 +00:00
remove_pattern:
Disable URLs which match a regex pattern. Matches are based on the percent-
encoded strings!
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
> opendirdl remove_pattern website.com.db ".*"
2016-01-24 20:48:39 +00:00
2016-05-21 19:51:36 +00:00
list_basenames:
2016-07-29 20:39:04 +00:00
List Enabled URLs alphabetized by their base filename. This makes it easier
to find titles of interest in a directory that is very scattered or poorly
2016-05-21 19:51:36 +00:00
organized.
2016-01-24 20:48:39 +00:00
> opendirdl list_basenames website.com.db <flags>
flags:
2016-05-10 08:00:29 +00:00
-o "x.txt" | --outputfile "x.txt":
2016-05-21 19:51:36 +00:00
Output the results to a file instead of stdout. This is useful if the
filenames contain special characters that crash Python, or are so long
that the console becomes unreadable.
2016-01-24 20:48:39 +00:00
2016-11-12 00:01:59 +00:00
list_urls:
List Enabled URLs in alphabetical order. No stylization, just dumped.
> opendirdl list_urls website.com.db <flags>
flags:
-o "x.txt" | --outputfile "x.txt":
Output the results to a file instead of stdout.
2016-05-21 19:51:36 +00:00
measure:
Sum up the filesizes of all Enabled URLs.
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
> opendirdl measure website.com.db <flags>
2016-01-24 20:48:39 +00:00
flags:
2016-05-10 08:00:29 +00:00
-f | --fullscan:
2016-07-10 04:38:49 +00:00
When included, perform HEAD requests on all files to update their size.
-n | --new_only:
2016-07-20 03:31:47 +00:00
When included, perform HEAD requests only on files that haven't gotten
one yet.
2016-07-10 04:38:49 +00:00
2016-11-11 23:52:25 +00:00
-t 4 | --threads 4:
The number of threads to use for performing requests.
2016-07-20 03:31:47 +00:00
If a file's size is not known by the time this operation completes, you
will receive a printed note.
2016-07-05 07:24:08 +00:00
tree:
Print the file / folder tree.
> opendirdl tree website.com.db <flags>
flags:
-o "x.txt" | --outputfile "x.txt":
Output the results to a file instead of stdout. This is useful if the
filenames contain special characters that crash Python, or are so long
that the console becomes unreadable.
2016-07-20 03:31:47 +00:00
If the filename ends with ".html", the created page will have
collapsible boxes rather than a plaintext diagram.
2016-01-24 20:48:39 +00:00
'''
2016-05-21 19:51:36 +00:00
2016-10-04 02:20:58 +00:00
# Module names preceeded by `## ` indicate modules that are imported during
2016-01-24 20:48:39 +00:00
# a function, because they are not used anywhere else and we don't need to waste
2016-10-04 02:20:58 +00:00
# time importing them usually, but I still want them listed here for clarity.
2016-01-24 20:48:39 +00:00
import argparse
2016-10-04 02:20:58 +00:00
## import bs4
2016-07-05 07:24:08 +00:00
import collections
2016-11-08 04:48:43 +00:00
import concurrent.futures
2016-10-04 02:20:58 +00:00
## import hashlib
2016-01-17 01:43:17 +00:00
import os
2016-10-04 02:20:58 +00:00
## import re
2016-01-17 01:43:17 +00:00
import requests
2016-02-21 07:13:50 +00:00
import shutil
2016-01-24 20:48:39 +00:00
import sqlite3
2016-10-04 02:20:58 +00:00
import sys
2017-02-19 01:06:55 +00:00
import threading
2017-01-16 06:04:20 +00:00
import time
2016-10-04 02:20:58 +00:00
## import tkinter
2016-01-17 01:43:17 +00:00
import urllib.parse
2016-12-25 03:18:23 +00:00
# pip install voussoirkit
2016-12-02 06:37:07 +00:00
from voussoirkit import bytestring
from voussoirkit import downloady
2017-01-16 06:04:20 +00:00
from voussoirkit import fusker
from voussoirkit import treeclass
from voussoirkit import pathtree
2017-02-19 01:06:55 +00:00
sys.path.append('C:\\git\\else\\threadqueue'); import threadqueue
2016-02-21 07:13:50 +00:00
2016-07-05 07:24:08 +00:00
DOWNLOAD_CHUNK = 16 * bytestring.KIBIBYTE
2016-10-04 02:20:58 +00:00
FILENAME_BADCHARS = '/\\:*?"<>|'
TERMINAL_WIDTH = shutil.get_terminal_size().columns
2016-07-20 03:31:47 +00:00
2016-08-13 00:26:12 +00:00
# When doing a basic scan, we will not send HEAD requests to URLs that end in
# these strings, because they're probably files.
# This isn't meant to be a comprehensive filetype library, but it covers
# enough of the typical opendir to speed things up.
2016-01-17 01:43:17 +00:00
SKIPPABLE_FILETYPES = [
2016-10-30 22:53:02 +00:00
'.3gp',
2016-12-02 06:37:07 +00:00
'.7z',
2016-10-04 02:20:58 +00:00
'.aac',
'.avi',
'.bin',
'.bmp',
'.bz2',
'.epub',
'.exe',
'.db',
'.flac',
'.gif',
'.gz',
'.ico',
'.iso',
'.jpeg',
'.jpg',
'.m3u',
'.m4a',
'.m4v',
'.mka',
'.mkv',
'.mov',
'.mp3',
'.mp4',
'.nfo',
'.ogg',
'.ott',
'.pdf',
'.png',
'.rar',
2017-02-19 01:06:55 +00:00
'.sfv',
2016-10-04 02:20:58 +00:00
'.srt',
'.tar',
'.ttf',
'.txt',
'.wav',
'.webm',
'.wma',
2016-12-13 03:53:21 +00:00
'.xml',
2016-10-04 02:20:58 +00:00
'.zip',
2016-01-17 01:43:17 +00:00
]
2016-01-24 20:48:39 +00:00
SKIPPABLE_FILETYPES = set(x.lower() for x in SKIPPABLE_FILETYPES)
2017-01-16 06:04:20 +00:00
SKIPPABLE_FILETYPES.update(fusker.fusker('.r[0-99]'))
SKIPPABLE_FILETYPES.update(fusker.fusker('.r[00-99]'))
SKIPPABLE_FILETYPES.update(fusker.fusker('.r[000-099]'))
SKIPPABLE_FILETYPES.update(fusker.fusker('.[00-99]'))
2016-01-24 20:48:39 +00:00
2016-07-28 03:41:13 +00:00
# Will be ignored completely. Are case-sensitive
BLACKLISTED_FILENAMES = [
2016-10-04 02:20:58 +00:00
'desktop.ini',
'thumbs.db',
2016-07-28 03:41:13 +00:00
]
2016-01-24 20:48:39 +00:00
DB_INIT = '''
CREATE TABLE IF NOT EXISTS urls(
url TEXT,
basename TEXT,
content_length INT,
content_type TEXT,
do_download INT
2016-07-20 03:31:47 +00:00
);
2016-01-24 20:48:39 +00:00
CREATE INDEX IF NOT EXISTS urlindex on urls(url);
CREATE INDEX IF NOT EXISTS baseindex on urls(basename);
CREATE INDEX IF NOT EXISTS sizeindex on urls(content_length);
'''.strip()
SQL_URL = 0
SQL_BASENAME = 1
SQL_CONTENT_LENGTH = 2
SQL_CONTENT_TYPE = 3
SQL_DO_DOWNLOAD = 4
2016-07-10 04:38:49 +00:00
UNMEASURED_WARNING = '''
Note: %d files do not have a stored Content-Length.
2016-07-20 03:31:47 +00:00
Run `measure` with `-f`|`--fullscan` or `-n`|`--new_only` to HEAD request
those files.
2016-07-10 04:38:49 +00:00
'''.strip()
2016-07-20 03:31:47 +00:00
## WALKER ##########################################################################################
## ##
class Walker:
2016-10-04 02:20:58 +00:00
'''
This class manages the extraction and saving of URLs, given a starting root url.
'''
2017-02-19 01:06:55 +00:00
def __init__(self, root_url, databasename=None, fullscan=False, threads=1):
2016-10-04 02:20:58 +00:00
if not root_url.endswith('/'):
root_url += '/'
if '://' not in root_url.split('.')[0]:
root_url = 'http://' + root_url
self.root_url = root_url
2016-07-28 03:41:13 +00:00
if databasename in (None, ''):
2016-10-04 02:20:58 +00:00
domain = url_split(self.root_url)['domain']
2016-07-28 03:41:13 +00:00
databasename = domain + '.db'
2016-07-20 03:31:47 +00:00
databasename = databasename.replace(':', '#')
self.databasename = databasename
2016-07-29 20:39:04 +00:00
write('Opening %s' % self.databasename)
2016-07-20 03:31:47 +00:00
self.sql = sqlite3.connect(self.databasename)
self.cur = self.sql.cursor()
db_init(self.sql, self.cur)
2017-02-19 01:06:55 +00:00
self.thread_queue = threadqueue.ThreadQueue(threads)
self._main_thread = threading.current_thread().ident
2016-07-20 03:31:47 +00:00
self.fullscan = bool(fullscan)
self.queue = collections.deque()
self.seen_directories = set()
def smart_insert(self, url=None, head=None, commit=True):
'''
See `smart_insert`.
'''
smart_insert(self.sql, self.cur, url=url, head=head, commit=commit)
def extract_hrefs(self, response, tag='a', attribute='href'):
'''
Given a Response object, extract href urls.
2016-07-28 03:41:13 +00:00
External links, index sort links, and blacklisted files are discarded.
2016-07-20 03:31:47 +00:00
'''
import bs4
soup = bs4.BeautifulSoup(response.text, 'html.parser')
2016-07-28 03:41:13 +00:00
elements = soup.find_all(tag)
2016-07-20 03:31:47 +00:00
for element in elements:
try:
href = element[attribute]
except KeyError:
continue
href = urllib.parse.urljoin(response.url, href)
2016-07-28 03:41:13 +00:00
2016-10-04 02:20:58 +00:00
if not href.startswith(self.root_url):
2016-07-20 03:31:47 +00:00
# Don't go to other sites or parent directories.
continue
2016-07-28 03:41:13 +00:00
2016-07-20 03:31:47 +00:00
if any(sorter in href for sorter in ('?C=', '?O=', '?M=', '?D=', '?N=', '?S=')):
# Alternative sort modes for index pages.
continue
2016-07-28 03:41:13 +00:00
if any(href.endswith(blacklisted) for blacklisted in BLACKLISTED_FILENAMES):
2016-07-20 03:31:47 +00:00
continue
2016-07-28 03:41:13 +00:00
2016-07-20 03:31:47 +00:00
yield href
def process_url(self, url=None):
'''
Given a URL, check whether it is an index page or an actual file.
2016-08-03 01:44:32 +00:00
If it is an index page, its links are extracted and queued.
2016-07-20 03:31:47 +00:00
If it is a file, its information is saved to the database.
2016-10-04 02:20:58 +00:00
We perform a
2016-07-20 03:31:47 +00:00
HEAD:
when `self.fullscan` is True.
when `self.fullscan` is False but the url is not a SKIPPABLE_FILETYPE.
when the url is an index page.
GET:
2016-08-09 08:33:36 +00:00
when the url is an index page.
2016-07-20 03:31:47 +00:00
'''
if url is None:
2016-10-04 02:20:58 +00:00
url = self.root_url
2016-07-20 03:31:47 +00:00
else:
2016-10-04 02:20:58 +00:00
url = urllib.parse.urljoin(self.root_url, url)
2016-07-20 03:31:47 +00:00
if url in self.seen_directories:
# We already picked this up at some point
return
2016-10-04 02:20:58 +00:00
if not url.startswith(self.root_url):
2016-07-20 03:31:47 +00:00
# Don't follow external links or parent directory.
2016-07-29 20:39:04 +00:00
write('Skipping "%s" due to external url.' % url)
2016-07-20 03:31:47 +00:00
return
urll = url.lower()
if self.fullscan is False:
skippable = any(urll.endswith(ext) for ext in SKIPPABLE_FILETYPES)
if skippable:
2016-07-29 20:39:04 +00:00
write('Skipping "%s" due to extension.' % url)
2017-02-19 01:06:55 +00:00
#self.smart_insert(url=url, commit=False)
#return {'url': url, 'commit': False}
self.thread_queue.behalf(self._main_thread, self.smart_insert, url=url, commit=False)
2016-07-20 03:31:47 +00:00
return
2017-02-19 01:06:55 +00:00
skippable = lambda: self.cur.execute('SELECT * FROM urls WHERE url == ?', [url]).fetchone()
skippable = self.thread_queue.behalf(self._main_thread, skippable)
#print(skippable)
skippable = skippable is not None
#skippable = self.cur.fetchone() is not None
2016-07-20 03:31:47 +00:00
if skippable:
2016-07-29 20:39:04 +00:00
write('Skipping "%s" since we already have it.' % url)
2016-07-20 03:31:47 +00:00
return
try:
head = do_head(url)
2016-10-04 02:20:58 +00:00
except requests.exceptions.HTTPError as exception:
if exception.response.status_code == 403:
2016-08-03 01:44:32 +00:00
write('403 FORBIDDEN!')
2016-07-20 03:31:47 +00:00
return
2016-10-04 02:20:58 +00:00
if exception.response.status_code == 404:
2016-08-03 01:44:32 +00:00
write('404 NOT FOUND!')
2016-07-20 03:31:47 +00:00
return
raise
content_type = head.headers.get('Content-Type', '?')
#print(content_type)
if content_type.startswith('text/html'):# and head.url.endswith('/'):
# This is an index page, so extract links and queue them.
response = do_get(url)
hrefs = self.extract_hrefs(response)
# Just in case the URL we used is different than the real one,
# such as missing a trailing slash, add both.
self.seen_directories.add(url)
self.seen_directories.add(head.url)
added = 0
for href in hrefs:
if href in self.seen_directories:
continue
else:
2017-02-19 01:06:55 +00:00
#self.queue.append(href)
self.thread_queue.add(self.process_url, href)
2016-07-20 03:31:47 +00:00
added += 1
2016-08-03 01:44:32 +00:00
write('Queued %d urls' % added)
2016-07-20 03:31:47 +00:00
else:
# This is not an index page, so save it.
2017-02-19 01:06:55 +00:00
#self.smart_insert(head=head, commit=False)
self.thread_queue.behalf(self._main_thread, self.smart_insert, head=head, commit=False)
#return {'head': head, 'commit': False}
2016-07-20 03:31:47 +00:00
def walk(self, url=None):
2016-10-04 02:20:58 +00:00
'''
Given a starting URL (defaults to self.root_url), continually extract
links from the page and repeat.
'''
2017-02-19 01:06:55 +00:00
#self.queue.appendleft(url)
self.thread_queue.add(self.process_url, url)
for return_value in self.thread_queue.run(hold_open=False):
pass
#try:
# while len(self.queue) > 0:
# url = self.queue.popleft()
# self.process_url(url)
# line = '{:,} Remaining'.format(len(self.queue))
# write(line)
#except:
# self.sql.commit()
# raise
2016-07-20 03:31:47 +00:00
self.sql.commit()
## ##
## WALKER ##########################################################################################
2016-01-24 20:48:39 +00:00
## GENERAL FUNCTIONS ###############################################################################
## ##
2016-07-29 20:39:04 +00:00
def build_file_tree(databasename):
sql = sqlite3.connect(databasename)
cur = sql.cursor()
cur.execute('SELECT * FROM urls WHERE do_download == 1')
2016-12-02 06:37:07 +00:00
fetch_all = cur.fetchall()
2016-07-29 20:39:04 +00:00
sql.close()
2016-12-02 06:37:07 +00:00
if len(fetch_all) == 0:
2016-08-03 01:44:32 +00:00
return
2016-07-29 20:39:04 +00:00
2017-01-16 06:04:20 +00:00
path_datas = []
# :|| is my temporary (probably not temporary) hack for including the URL
# scheme without causing the pathtree processor to think there's a top
# level directory called 'http'.
# It will be replaced with :// in the calling `tree` function.
path_form = '{scheme}:||{domain}\\{folder}\\{filename}'
2016-12-02 06:37:07 +00:00
for item in fetch_all:
url = item[SQL_URL]
size = item[SQL_CONTENT_LENGTH]
path_parts = url_split(item[SQL_URL])
2017-01-16 06:04:20 +00:00
path = path_form.format(**path_parts)
path = urllib.parse.unquote(path)
path_data = {'path': path, 'size': size, 'data': url}
path_datas.append(path_data)
2016-07-29 20:39:04 +00:00
2017-01-16 06:04:20 +00:00
return pathtree.from_paths(path_datas, root_name=databasename)
2016-07-29 20:39:04 +00:00
2016-01-24 20:48:39 +00:00
def db_init(sql, cur):
lines = DB_INIT.split(';')
for line in lines:
cur.execute(line)
sql.commit()
return True
2016-01-17 01:43:17 +00:00
2016-07-05 07:24:08 +00:00
def do_get(url, raise_for_status=True):
2016-08-03 01:44:32 +00:00
return do_request('GET', requests.get, url, raise_for_status=raise_for_status)
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
def do_head(url, raise_for_status=True):
2016-08-03 01:44:32 +00:00
return do_request('HEAD', requests.head, url, raise_for_status=raise_for_status)
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
def do_request(message, method, url, raise_for_status=True):
2016-11-11 23:52:25 +00:00
form = '{message:>4s}: {url} : {status}'
write(form.format(message=message, url=url, status=''))
2016-01-24 20:48:39 +00:00
response = method(url)
2016-11-11 23:52:25 +00:00
write(form.format(message=message, url=url, status=response.status_code))
2016-07-05 07:24:08 +00:00
if raise_for_status:
response.raise_for_status()
2016-01-24 20:48:39 +00:00
return response
2016-10-04 02:20:58 +00:00
2016-07-28 03:41:13 +00:00
def fetch_generator(cur):
while True:
fetch = cur.fetchone()
if fetch is None:
break
yield fetch
2016-01-24 20:48:39 +00:00
def filepath_sanitize(text, allowed=''):
2016-10-04 02:20:58 +00:00
'''
Remove forbidden characters from the text, unless specifically sanctioned.
'''
2016-07-28 03:41:13 +00:00
badchars = FILENAME_BADCHARS
2016-08-18 01:24:38 +00:00
badchars = set(char for char in FILENAME_BADCHARS if char not in allowed)
2016-07-28 03:41:13 +00:00
text = ''.join(char for char in text if char not in badchars)
2016-01-17 01:43:17 +00:00
return text
2016-01-24 20:48:39 +00:00
def get_clipboard():
import tkinter
t = tkinter.Tk()
clip = t.clipboard_get()
t.destroy()
return clip
2016-01-17 01:43:17 +00:00
def hashit(text, length=None):
2016-01-24 20:48:39 +00:00
import hashlib
2016-10-04 02:20:58 +00:00
sha = hashlib.sha512(text.encode('utf-8')).hexdigest()
2016-01-17 01:43:17 +00:00
if length is not None:
2016-10-04 02:20:58 +00:00
sha = sha[:length]
return sha
2016-01-24 20:48:39 +00:00
2016-11-11 23:52:25 +00:00
def int_none(x):
if x is None:
return x
return int(x)
2017-01-16 06:04:20 +00:00
def promise_results(promises):
promises = promises[:]
while len(promises) > 0:
for (index, promise) in enumerate(promises):
if not promise.done():
continue
yield promise.result()
promises.pop(index)
break
2016-07-29 20:39:04 +00:00
2016-10-04 02:20:58 +00:00
def safeindex(sequence, index, fallback=None):
try:
return sequence[index]
except IndexError:
return fallback
2016-10-22 03:47:08 +00:00
def safeprint(*texts, **kwargs):
texts = [str(text).encode('ascii', 'replace').decode() for text in texts]
print(*texts, **kwargs)
2016-01-17 01:43:17 +00:00
2016-01-24 20:48:39 +00:00
def smart_insert(sql, cur, url=None, head=None, commit=True):
'''
2016-07-05 07:24:08 +00:00
INSERT or UPDATE the appropriate entry, or DELETE if the head
shows a 403 / 404.
2016-01-24 20:48:39 +00:00
'''
2016-08-13 00:26:12 +00:00
if bool(url) is bool(head) and not isinstance(head, requests.Response):
2016-01-24 20:48:39 +00:00
raise ValueError('One and only one of `url` or `head` is necessary.')
if url is not None:
# When doing a basic scan, all we get is the URL.
content_length = None
content_type = None
elif head is not None:
2016-08-13 00:26:12 +00:00
url = head.url
2016-01-24 20:48:39 +00:00
# When doing a full scan, we get a Response object.
2016-07-05 07:24:08 +00:00
if head.status_code in [403, 404]:
cur.execute('DELETE FROM urls WHERE url == ?', [url])
if commit:
sql.commit()
return (url, None, 0, None, 0)
else:
url = head.url
content_length = head.headers.get('Content-Length', None)
if content_length is not None:
content_length = int(content_length)
content_type = head.headers.get('Content-Type', None)
2016-01-24 20:48:39 +00:00
2016-08-13 00:26:12 +00:00
basename = url_split(url)['filename']
2016-12-02 06:37:07 +00:00
#basename = urllib.parse.unquote(basename)
2016-01-24 20:48:39 +00:00
do_download = True
2016-07-05 07:24:08 +00:00
2016-01-24 20:48:39 +00:00
cur.execute('SELECT * FROM urls WHERE url == ?', [url])
existing_entry = cur.fetchone()
is_new = existing_entry is None
2016-07-05 07:24:08 +00:00
2016-01-24 20:48:39 +00:00
data = (url, basename, content_length, content_type, do_download)
if is_new:
cur.execute('INSERT INTO urls VALUES(?, ?, ?, ?, ?)', data)
else:
command = '''
UPDATE urls SET
content_length = coalesce(?, content_length),
content_type = coalesce(?, content_type)
WHERE url == ?
'''
cur.execute(command, [content_length, content_type, url])
2016-07-05 07:24:08 +00:00
2016-01-24 20:48:39 +00:00
if commit:
sql.commit()
return data
2016-10-04 02:20:58 +00:00
def url_split(url):
'''
Given a url, return a dictionary of its components.
'''
2016-12-02 06:37:07 +00:00
#url = urllib.parse.unquote(url)
2016-10-04 02:20:58 +00:00
parts = urllib.parse.urlsplit(url)
2016-08-01 23:42:03 +00:00
if any(part == '' for part in [parts.scheme, parts.netloc]):
raise ValueError('Not a valid URL')
2016-07-05 07:24:08 +00:00
scheme = parts.scheme
2016-01-17 01:43:17 +00:00
root = parts.netloc
(folder, filename) = os.path.split(parts.path)
while folder.startswith('/'):
folder = folder[1:]
2016-01-24 20:48:39 +00:00
# Folders are allowed to have slashes...
folder = filepath_sanitize(folder, allowed='/\\')
2016-01-17 01:43:17 +00:00
folder = folder.replace('\\', os.path.sep)
folder = folder.replace('/', os.path.sep)
2016-01-24 20:48:39 +00:00
# ...but Files are not.
2016-01-17 01:43:17 +00:00
filename = filepath_sanitize(filename)
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
result = {
'scheme': scheme,
2016-12-02 06:37:07 +00:00
'domain': urllib.parse.unquote(root),
'folder': urllib.parse.unquote(folder),
'filename': urllib.parse.unquote(filename),
2016-07-05 07:24:08 +00:00
}
return result
2016-07-20 03:31:47 +00:00
2016-07-29 20:39:04 +00:00
def write(line, file_handle=None, **kwargs):
2016-07-20 03:31:47 +00:00
if file_handle is None:
2016-07-29 20:39:04 +00:00
safeprint(line, **kwargs)
2016-07-20 03:31:47 +00:00
else:
2016-07-29 20:39:04 +00:00
file_handle.write(line + '\n', **kwargs)
2016-01-24 20:48:39 +00:00
## ##
## GENERAL FUNCTIONS ###############################################################################
## COMMANDLINE FUNCTIONS ###########################################################################
## ##
2017-02-19 01:06:55 +00:00
def digest(root_url, databasename=None, fullscan=False, threads=1):
2016-10-04 02:20:58 +00:00
if root_url in ('!clipboard', '!c'):
root_url = get_clipboard()
write('From clipboard: %s' % root_url)
2016-01-24 20:48:39 +00:00
walker = Walker(
2016-07-05 07:24:08 +00:00
databasename=databasename,
2016-01-24 20:48:39 +00:00
fullscan=fullscan,
2016-10-04 02:20:58 +00:00
root_url=root_url,
2017-02-19 01:06:55 +00:00
threads=threads,
2016-07-20 03:31:47 +00:00
)
2016-01-24 20:48:39 +00:00
walker.walk()
2016-07-05 07:24:08 +00:00
def digest_argparse(args):
return digest(
databasename=args.databasename,
fullscan=args.fullscan,
2016-10-04 02:20:58 +00:00
root_url=args.root_url,
2017-02-19 01:06:55 +00:00
threads=int(args.threads),
2016-07-05 07:24:08 +00:00
)
2016-07-28 03:41:13 +00:00
def download(
databasename,
outputdir=None,
bytespersecond=None,
headers=None,
overwrite=False,
):
'''
Download all of the Enabled files. The filepaths will match that of the
website, using `outputdir` as the root directory.
Parameters:
outputdir:
The directory to mirror the files into. If not provided, the domain
name is used.
bytespersecond:
The speed to ratelimit the downloads. Can be an integer, or a string like
'500k', according to the capabilities of `bytestring.parsebytes`
Note that this is bytes, not bits.
headers:
Additional headers to pass to each `download_file` call.
overwrite:
If True, delete local copies of existing files and rewrite them.
Otherwise, completed files are skipped.
'''
sql = sqlite3.connect(databasename)
cur = sql.cursor()
if outputdir in (None, ''):
# This assumes that all URLs in the database are from the same domain.
2016-08-01 23:42:03 +00:00
# If they aren't, it's the user's fault because Walkers don't leave the given site
# on their own.
2016-07-28 03:41:13 +00:00
cur.execute('SELECT url FROM urls LIMIT 1')
url = cur.fetchone()[0]
2016-10-04 02:20:58 +00:00
outputdir = url_split(url)['domain']
2016-07-28 03:41:13 +00:00
2016-01-24 20:48:39 +00:00
if isinstance(bytespersecond, str):
2016-07-28 03:41:13 +00:00
bytespersecond = bytestring.parsebytes(bytespersecond)
2016-01-24 20:48:39 +00:00
2016-07-28 03:41:13 +00:00
cur.execute('SELECT * FROM urls WHERE do_download == 1 ORDER BY url')
for fetch in fetch_generator(cur):
url = fetch[SQL_URL]
2016-08-13 00:26:12 +00:00
url_filepath = url_split(url)
2016-07-28 03:41:13 +00:00
folder = os.path.join(outputdir, url_filepath['folder'])
os.makedirs(folder, exist_ok=True)
2016-08-18 01:24:38 +00:00
fullname = os.path.join(folder, url_filepath['filename'])
2016-07-28 03:41:13 +00:00
2016-08-18 01:24:38 +00:00
write('Downloading "%s"' % fullname)
2016-07-29 20:39:04 +00:00
downloady.download_file(
url,
2016-08-18 01:24:38 +00:00
localname=fullname,
2016-07-29 20:39:04 +00:00
bytespersecond=bytespersecond,
2016-10-22 03:47:08 +00:00
callback_progress=downloady.Progress2,
2016-10-04 02:20:58 +00:00
headers=headers,
overwrite=overwrite,
2016-07-29 20:39:04 +00:00
)
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
def download_argparse(args):
return download(
databasename=args.databasename,
outputdir=args.outputdir,
overwrite=args.overwrite,
bytespersecond=args.bytespersecond,
)
2016-10-04 02:20:58 +00:00
def filter_pattern(databasename, regex, action='keep'):
2016-01-24 20:48:39 +00:00
'''
When `action` is 'keep', then any URLs matching the regex will have their
`do_download` flag set to True.
2016-01-17 01:43:17 +00:00
2016-01-24 20:48:39 +00:00
When `action` is 'remove', then any URLs matching the regex will have their
`do_download` flag set to False.
2016-07-20 03:31:47 +00:00
Actions will not act on each other's behalf. Keep will NEVER disable a url,
and remove will NEVER enable one.
2016-01-24 20:48:39 +00:00
'''
import re
if isinstance(regex, str):
regex = [regex]
2016-01-17 01:43:17 +00:00
2016-01-24 20:48:39 +00:00
keep = action == 'keep'
remove = action == 'remove'
2016-01-17 01:43:17 +00:00
2016-01-24 20:48:39 +00:00
sql = sqlite3.connect(databasename)
cur = sql.cursor()
2016-07-28 03:41:13 +00:00
cur.execute('SELECT * FROM urls')
items = cur.fetchall()
for item in items:
url = item[SQL_URL]
2016-01-24 20:48:39 +00:00
for pattern in regex:
contains = re.search(pattern, url) is not None
2016-10-04 02:20:58 +00:00
if keep and contains and not item[SQL_DO_DOWNLOAD]:
2016-07-29 20:39:04 +00:00
write('Enabling "%s"' % url)
2016-01-24 20:48:39 +00:00
cur.execute('UPDATE urls SET do_download = 1 WHERE url == ?', [url])
2016-10-04 02:20:58 +00:00
if remove and contains and item[SQL_DO_DOWNLOAD]:
2016-07-29 20:39:04 +00:00
write('Disabling "%s"' % url)
2016-01-24 20:48:39 +00:00
cur.execute('UPDATE urls SET do_download = 0 WHERE url == ?', [url])
sql.commit()
2016-07-20 03:31:47 +00:00
def keep_pattern_argparse(args):
2016-01-24 20:48:39 +00:00
'''
See `filter_pattern`.
2016-01-17 01:43:17 +00:00
'''
2016-07-20 03:31:47 +00:00
return filter_pattern(
2016-01-24 20:48:39 +00:00
action='keep',
databasename=args.databasename,
regex=args.regex,
2016-07-20 03:31:47 +00:00
)
2016-01-17 01:43:17 +00:00
2016-07-20 03:31:47 +00:00
def list_basenames(databasename, output_filename=None):
2016-01-17 01:43:17 +00:00
'''
2016-07-20 03:31:47 +00:00
Print the Enabled entries in order of the file basenames.
2016-01-24 20:48:39 +00:00
This makes it easier to find interesting titles without worrying about
what directory they're in.
'''
sql = sqlite3.connect(databasename)
cur = sql.cursor()
2016-07-20 03:31:47 +00:00
cur.execute('SELECT * FROM urls WHERE do_download == 1')
items = cur.fetchall()
2016-07-20 03:35:35 +00:00
longest = max(items, key=lambda x: len(x[SQL_BASENAME]))
longest = len(longest[SQL_BASENAME])
2016-07-20 03:31:47 +00:00
items.sort(key=lambda x: x[SQL_BASENAME].lower())
if output_filename is not None:
output_file = open(output_filename, 'w', encoding='utf-8')
2016-07-20 03:35:35 +00:00
else:
output_file = None
form = '{basename:<%ds} : {url} : {size}' % longest
2016-07-20 03:31:47 +00:00
for item in items:
size = item[SQL_CONTENT_LENGTH]
if size is None:
size = ''
2016-01-24 20:48:39 +00:00
else:
2016-07-20 03:31:47 +00:00
size = bytestring.bytestring(size)
line = form.format(
basename=item[SQL_BASENAME],
url=item[SQL_URL],
size=size,
)
2016-07-20 03:35:35 +00:00
write(line, output_file)
2016-07-20 03:31:47 +00:00
if output_file:
output_file.close()
2016-01-17 01:43:17 +00:00
2016-07-05 07:24:08 +00:00
def list_basenames_argparse(args):
return list_basenames(
databasename=args.databasename,
2016-07-20 03:31:47 +00:00
output_filename=args.outputfile,
2016-07-05 07:24:08 +00:00
)
2016-11-11 23:52:25 +00:00
def list_urls(databasename, output_filename=None):
'''
Print the Enabled entries in order of the file basenames.
This makes it easier to find interesting titles without worrying about
what directory they're in.
'''
sql = sqlite3.connect(databasename)
cur = sql.cursor()
cur.execute('SELECT * FROM urls WHERE do_download == 1')
items = cur.fetchall()
items.sort(key=lambda x: x[SQL_URL].lower())
if output_filename is not None:
output_file = open(output_filename, 'w', encoding='utf-8')
else:
output_file = None
for item in items:
write(item[SQL_URL], output_file)
if output_file:
output_file.close()
def list_urls_argparse(args):
return list_urls(
databasename=args.databasename,
output_filename=args.outputfile,
)
def measure(databasename, fullscan=False, new_only=False, threads=4):
2016-01-24 20:48:39 +00:00
'''
Given a database, print the sum of all Content-Lengths.
2016-07-10 04:38:49 +00:00
URLs will be HEAD requested if:
`new_only` is True and the file has no stored content length, or
`fullscan` is True and `new_only` is False
2016-01-24 20:48:39 +00:00
'''
if isinstance(fullscan, str):
fullscan = bool(fullscan)
sql = sqlite3.connect(databasename)
2016-07-10 04:38:49 +00:00
cur = sql.cursor()
if new_only:
cur.execute('SELECT * FROM urls WHERE do_download == 1 AND content_length IS NULL')
else:
cur.execute('SELECT * FROM urls WHERE do_download == 1')
items = cur.fetchall()
2016-08-13 00:26:12 +00:00
filecount = len(items)
2016-11-11 23:52:25 +00:00
totalsize = 0
2016-07-05 07:24:08 +00:00
unmeasured_file_count = 0
2016-11-11 23:52:25 +00:00
if threads is None:
threads = 1
2016-07-05 07:24:08 +00:00
2017-02-19 01:06:55 +00:00
thread_queue = threadqueue.ThreadQueue(threads)
2016-11-11 23:52:25 +00:00
try:
for fetch in items:
2016-08-09 08:33:36 +00:00
size = fetch[SQL_CONTENT_LENGTH]
2016-07-05 07:24:08 +00:00
2016-11-11 23:52:25 +00:00
if fullscan or new_only:
url = fetch[SQL_URL]
2017-02-19 01:06:55 +00:00
thread_queue.add(do_head, url, raise_for_status=False)
2016-08-09 08:33:36 +00:00
2016-11-11 23:52:25 +00:00
elif size is None:
# Unmeasured and no intention to measure.
unmeasured_file_count += 1
else:
totalsize += size
2016-08-13 00:26:12 +00:00
2017-02-19 01:06:55 +00:00
for head in thread_queue.run():
fetch = smart_insert(sql, cur, head=head, commit=False)
2016-11-11 23:52:25 +00:00
size = fetch[SQL_CONTENT_LENGTH]
if size is None:
write('"%s" is not revealing Content-Length' % url)
size = 0
totalsize += size
2017-01-16 06:04:20 +00:00
except (Exception, KeyboardInterrupt):
2017-02-19 01:06:55 +00:00
sql.commit()
2016-11-11 23:52:25 +00:00
raise
2016-01-24 20:48:39 +00:00
sql.commit()
2016-08-13 00:26:12 +00:00
size_string = bytestring.bytestring(totalsize)
totalsize_string = '{size_short} ({size_exact:,} bytes) in {filecount:,} files'
totalsize_string = totalsize_string.format(
size_short=size_string,
size_exact=totalsize,
filecount=filecount,
)
2016-08-03 01:44:32 +00:00
write(totalsize_string)
2016-07-05 07:24:08 +00:00
if unmeasured_file_count > 0:
2016-08-03 01:44:32 +00:00
write(UNMEASURED_WARNING % unmeasured_file_count)
2016-01-24 20:48:39 +00:00
return totalsize
2016-07-05 07:24:08 +00:00
def measure_argparse(args):
return measure(
databasename=args.databasename,
fullscan=args.fullscan,
2016-07-10 04:38:49 +00:00
new_only=args.new_only,
2016-11-11 23:52:25 +00:00
threads=int_none(args.threads),
2016-07-05 07:24:08 +00:00
)
2016-07-20 03:31:47 +00:00
def remove_pattern_argparse(args):
2016-01-24 20:48:39 +00:00
'''
See `filter_pattern`.
'''
2016-07-20 03:31:47 +00:00
return filter_pattern(
2016-01-24 20:48:39 +00:00
action='remove',
databasename=args.databasename,
regex=args.regex,
2016-07-20 03:31:47 +00:00
)
2016-07-05 07:24:08 +00:00
def tree(databasename, output_filename=None):
2016-07-20 03:31:47 +00:00
'''
Print a tree diagram of the directory-file structure.
If an .html file is given for `output_filename`, the page will have
collapsible boxes and clickable filenames. Otherwise the file will just
be a plain text drawing.
'''
2016-10-04 02:20:58 +00:00
tree_root = build_file_tree(databasename)
2017-01-16 06:04:20 +00:00
tree_root.path = None
for node in tree_root.walk():
if node.path:
node.path = node.path.replace(':||', '://')
node.display_name = node.display_name.replace(':||', '://')
2016-07-05 07:24:08 +00:00
if output_filename is not None:
output_file = open(output_filename, 'w', encoding='utf-8')
2016-07-10 04:38:49 +00:00
use_html = output_filename.lower().endswith('.html')
else:
output_file = None
use_html = False
2017-01-16 06:04:20 +00:00
size_details = pathtree.recursive_get_size(tree_root)
2016-07-05 07:24:08 +00:00
2016-07-29 20:39:04 +00:00
if size_details['unmeasured'] > 0:
2017-01-16 06:04:20 +00:00
footer = UNMEASURED_WARNING % size_details['unmeasured']
else:
footer = None
line_generator = pathtree.recursive_print_node(tree_root, use_html=use_html, footer=footer)
for line in line_generator:
write(line, output_file)
2016-07-05 07:24:08 +00:00
if output_file is not None:
output_file.close()
2016-10-04 02:20:58 +00:00
return tree_root
2016-07-05 07:24:08 +00:00
def tree_argparse(args):
return tree(
databasename=args.databasename,
output_filename=args.outputfile,
)
2016-01-24 20:48:39 +00:00
## ##
## COMMANDLINE FUNCTIONS ###########################################################################
2016-07-20 03:31:47 +00:00
def main(argv):
2016-10-04 02:20:58 +00:00
if safeindex(argv, 1, '').lower() in ('help', '-h', '--help', ''):
write(__doc__)
2016-07-20 03:31:47 +00:00
return
2016-01-24 20:48:39 +00:00
parser = argparse.ArgumentParser()
subparsers = parser.add_subparsers()
p_digest = subparsers.add_parser('digest')
2016-10-04 02:20:58 +00:00
p_digest.add_argument('root_url')
2016-01-24 20:48:39 +00:00
p_digest.add_argument('-db', '--database', dest='databasename', default=None)
2016-07-05 07:24:08 +00:00
p_digest.add_argument('-f', '--fullscan', dest='fullscan', action='store_true')
2017-02-19 01:06:55 +00:00
p_digest.add_argument('-t', '--threads', dest='threads', default=1)
2016-07-05 07:24:08 +00:00
p_digest.set_defaults(func=digest_argparse)
2016-01-24 20:48:39 +00:00
p_download = subparsers.add_parser('download')
p_download.add_argument('databasename')
p_download.add_argument('-o', '--outputdir', dest='outputdir', default=None)
p_download.add_argument('-bps', '--bytespersecond', dest='bytespersecond', default=None)
2016-07-05 07:24:08 +00:00
p_download.add_argument('-ow', '--overwrite', dest='overwrite', action='store_true')
p_download.set_defaults(func=download_argparse)
2016-01-24 20:48:39 +00:00
p_keep_pattern = subparsers.add_parser('keep_pattern')
p_keep_pattern.add_argument('databasename')
p_keep_pattern.add_argument('regex')
2016-07-20 03:31:47 +00:00
p_keep_pattern.set_defaults(func=keep_pattern_argparse)
2016-01-24 20:48:39 +00:00
p_list_basenames = subparsers.add_parser('list_basenames')
p_list_basenames.add_argument('databasename')
2016-07-05 07:24:08 +00:00
p_list_basenames.add_argument('-o', '--outputfile', dest='outputfile', default=None)
p_list_basenames.set_defaults(func=list_basenames_argparse)
2016-01-24 20:48:39 +00:00
2016-11-11 23:52:25 +00:00
p_list_urls = subparsers.add_parser('list_urls')
p_list_urls.add_argument('databasename')
p_list_urls.add_argument('-o', '--outputfile', dest='outputfile', default=None)
p_list_urls.set_defaults(func=list_urls_argparse)
2016-01-24 20:48:39 +00:00
p_measure = subparsers.add_parser('measure')
p_measure.add_argument('databasename')
2016-07-05 07:24:08 +00:00
p_measure.add_argument('-f', '--fullscan', dest='fullscan', action='store_true')
2016-07-10 04:38:49 +00:00
p_measure.add_argument('-n', '--new_only', dest='new_only', action='store_true')
2016-11-11 23:52:25 +00:00
p_measure.add_argument('-t', '--threads', dest='threads', default=1)
2016-07-05 07:24:08 +00:00
p_measure.set_defaults(func=measure_argparse)
2016-01-24 20:48:39 +00:00
p_remove_pattern = subparsers.add_parser('remove_pattern')
p_remove_pattern.add_argument('databasename')
p_remove_pattern.add_argument('regex')
2016-07-20 03:31:47 +00:00
p_remove_pattern.set_defaults(func=remove_pattern_argparse)
2016-01-24 20:48:39 +00:00
2016-07-05 07:24:08 +00:00
p_tree = subparsers.add_parser('tree')
p_tree.add_argument('databasename')
p_tree.add_argument('-o', '--outputfile', dest='outputfile', default=None)
p_tree.set_defaults(func=tree_argparse)
2016-11-11 23:52:25 +00:00
# Allow interchangability of the command and database name
# opendirdl measure test.db -n = opendirdl test.db measure -n
2016-11-12 00:01:59 +00:00
if argv[0] != 'digest' and os.path.isfile(argv[0]):
2016-11-11 23:52:25 +00:00
(argv[0], argv[1]) = (argv[1], argv[0])
#print(argv)
2016-07-20 03:31:47 +00:00
args = parser.parse_args(argv)
2016-01-24 20:48:39 +00:00
args.func(args)
2016-07-20 03:31:47 +00:00
if __name__ == '__main__':
main(sys.argv[1:])