import argparse import bs4 import datetime import html import logging import requests import sqlite3 import sys import time from voussoirkit import backoff from voussoirkit import betterhelp from voussoirkit import httperrors from voussoirkit import mutables from voussoirkit import operatornotify from voussoirkit import ratelimiter from voussoirkit import sqlhelpers from voussoirkit import threadpool from voussoirkit import treeclass from voussoirkit import vlogging log = vlogging.getLogger(__name__, 'hnarchive') VERSION = '1.0.0' HEADERS = { 'User-Agent': f'voussoir/hnarchive v{VERSION}.', } session = requests.Session() session.headers.update(HEADERS) DB_INIT = ''' BEGIN; PRAGMA user_version = 1; CREATE TABLE IF NOT EXISTS items( id INT PRIMARY KEY NOT NULL, deleted INT, type TEXT, author TEXT, time INT, text TEXT, dead INT, parent TEXT, poll TEXT, url TEXT, score INT, title TEXT, descendants INT, retrieved INT ); CREATE INDEX IF NOT EXISTS index_items_id on items(id); CREATE INDEX IF NOT EXISTS index_items_parent on items(parent); CREATE INDEX IF NOT EXISTS index_items_poll on items(poll) WHERE poll IS NOT NULL; CREATE INDEX IF NOT EXISTS index_items_time on items(time); CREATE INDEX IF NOT EXISTS index_items_type_time on items(type, time); CREATE INDEX IF NOT EXISTS index_items_age_at_retrieval on items(retrieved - time); COMMIT; ''' COLUMNS = sqlhelpers.extract_table_column_map(DB_INIT) ITEMS_COLUMNS = COLUMNS['items'] sql = sqlite3.connect('hnarchive.db') sqlhelpers.executescript(sql, DB_INIT) # HELPERS ########################################################################################## def ctrlc_commit(function): def wrapped(*args, **kwargs): try: return function(*args, **kwargs) except KeyboardInterrupt: commit() return 1 return wrapped def int_or_none(x): if x is None: return x return int(x) # API ############################################################################################## def get(url, retries=1): bo = backoff.Quadratic(a=0.2, b=0, c=1, max=10) while retries > 0: try: log.loud(url) response = session.get(url, timeout=2) httperrors.raise_for_status(response) return response except ( httperrors.HTTP429, httperrors.HTTP5XX, requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout, ): # Any other 4XX should raise. retries -= 1 log.loud('Request failed, %d tries remain.', retries) time.sleep(bo.next()) raise RuntimeError(f'Ran out of retries on {url}.') def get_item(id): url = f'https://hacker-news.firebaseio.com/v0/item/{id}.json' response = get(url, retries=8) item = response.json() if item is None: return None if 'time' not in item: # For example, 78692 from the api shows {"id": 78692, "type": "story"}, # but the web says "No such item." # https://hacker-news.firebaseio.com/v0/item/78692.json # https://news.ycombinator.com/item?id=78692 return None return item def get_items(ids, threads=None): if threads: return get_items_multithreaded(ids, threads) else: return get_items_singlethreaded(ids) def get_items_multithreaded(ids, threads): pool = threadpool.ThreadPool(threads, paused=True) job_gen = ({'function': get_item, 'kwargs': {'id': id}} for id in ids) pool.add_generator(job_gen) for job in pool.result_generator(buffer_size=250): if job.exception: raise job.exception if job.value is not None: yield job.value def get_items_singlethreaded(ids): for id in ids: item = get_item(id) if item is not None: yield item def get_latest_id(): url = 'https://hacker-news.firebaseio.com/v0/maxitem.json' response = get(url) latest_id = int(response.text) return latest_id def livestream(): bo = backoff.Linear(m=2, b=5, max=60) id = select_latest_id() or 1 # missed_loops: # Usually, livestream assumes that `item is None` means the requested item # id hasn't been published yet. But, if that item is actually just deleted, # we would be stuck waiting for it forever. missed_loops is used to # ocassionally check get_latest_id to see if new items are available, so we # know that the current id is really just deleted. # Items are released in small batches of < ~10 at a time. It is important # that the number in `latest > id+XXX` is big enough that we are sure the # requested item is really dead and not just part of a fresh batch that # beat our check in a race condition (consider that between the last # iteration which triggered the check and the call to get_latest_id, the # item we were waiting for is published in a new batch). I chose 50 because # catching up with 50 items is not a big deal. missed_loops = 0 while True: item = get_item(id) if item is None: log.debug('%s does not exist yet.', id) missed_loops += 1 if missed_loops % 5 == 0: latest = get_latest_id() if latest > (id+50): log.debug('Skipping %s because future ids exist.', id) id += 1 continue time.sleep(bo.next()) continue id += 1 missed_loops = 0 bo.rewind(2) yield item # DATABASE ######################################################################################### def commit(): log.info('Committing.') sql.commit() def insert_item(data): id = data['id'] retrieved = int(time.time()) existing = select_item(id) if existing is None: row = { 'id': id, 'deleted': bool(data.get('deleted', False)), 'type': data['type'], 'author': data.get('by', None), 'time': int(data['time']), 'text': data.get('text', None), 'dead': bool(data.get('dead', False)), 'parent': data.get('parent', None), 'poll': data.get('poll', None), 'url': data.get('url', None), 'score': int_or_none(data.get('score', None)), 'title': data.get('title', None), 'descendants': int_or_none(data.get('descendants', None)), 'retrieved': retrieved, } log.info('Inserting item %s.', id) (qmarks, bindings) = sqlhelpers.insert_filler(ITEMS_COLUMNS, row, require_all=True) query = f'INSERT INTO items VALUES({qmarks})' sql.execute(query, bindings) log.loud('Inserted item %s.', id) else: row = { 'id': id, 'deleted': bool(data.get('deleted', False)), 'type': data['type'], 'author': data.get('by', existing.get('author', None)), 'time': int(data['time']), 'text': data.get('text', existing.get('text', None)), 'dead': bool(data.get('dead', False)), 'parent': data.get('parent', None), 'poll': data.get('poll', existing.get('poll', None)), 'url': data.get('url', existing.get('url', None)), 'score': int_or_none(data.get('score', existing.get('score', None))), 'title': data.get('title', existing.get('title', None)), 'descendants': int_or_none(data.get('descendants', None)), 'retrieved': retrieved, } log.info('Updating item %s.', id) (qmarks, bindings) = sqlhelpers.update_filler(row, where_key='id') query = f'UPDATE items {qmarks}' sql.execute(query, bindings) log.loud('Updated item %s.', id) return {'row': row, 'is_new': existing is None} def insert_items(items, commit_period=200): ticker = 0 for item in items: insert_item(item) ticker = (ticker + 1) % commit_period if ticker == 0: commit() commit() def select_child_items(id): ''' Return items whose parent is this id. ''' cur = sql.execute('SELECT * FROM items WHERE parent == ?', [id]) rows = cur.fetchall() items = [dict(zip(ITEMS_COLUMNS, row)) for row in rows] return items def select_poll_options(id): ''' Return items that are pollopts under this given poll id. ''' cur = sql.execute('SELECT * FROM items WHERE poll == ?', [id]) rows = cur.fetchall() items = [dict(zip(ITEMS_COLUMNS, row)) for row in rows] return items def select_item(id): cur = sql.execute('SELECT * FROM items WHERE id == ?', [id]) row = cur.fetchone() if row is None: return None item = dict(zip(ITEMS_COLUMNS, row)) return item def select_latest_id(): cur = sql.execute('SELECT id FROM items ORDER BY id DESC LIMIT 1') row = cur.fetchone() if row is None: return None return row[0] # RENDERING ######################################################################################## def _fix_ptags(text): ''' The text returned by HN only puts
in between paragraphs, they do not add closing tags or put an opening
on the first paragraph. If the user typed a literal
then it will have been stored with < and > so it won't get messed up here. ''' text = text.replace('
', '
') text = '
' + text + '
' return text def build_item_tree(*, id=None, item=None): if id is not None and item is None: item = select_item(id) if item is None: raise ValueError('We dont have that item in the database.') elif item is not None and id is None: id = item['id'] else: raise TypeError('Please pass only one of id, item.') tree = treeclass.Tree(str(id), data=item) for child in select_child_items(id): tree.add_child(build_item_tree(item=child)) return tree def html_render_comment(*, soup, item): div = soup.new_tag('div') div['class'] = item['type'] div['id'] = item['id'] userinfo = soup.new_tag('p') div.append(userinfo) author = item['author'] or '[deleted]' username = soup.new_tag('a', href=f'https://news.ycombinator.com/user?id={author}') username.append(author) userinfo.append(username) userinfo.append(' | ') date = datetime.datetime.utcfromtimestamp(item['time']) date = date.strftime('%Y %b %d %H:%M:%S') timestamp = soup.new_tag('a', href=f'https://news.ycombinator.com/item?id={item["id"]}') timestamp.append(date) userinfo.append(timestamp) text = item['text'] or '[deleted]' text = bs4.BeautifulSoup(_fix_ptags(text), 'html.parser') div.append(text) return div def html_render_comment_tree(*, soup, tree): div = html_render_comment(soup=soup, item=tree.data) for child in tree.list_children(sort=lambda node: node.data['time']): div.append(html_render_comment_tree(soup=soup, tree=child)) return div def html_render_job(*, soup, item): div = soup.new_tag('div') div['class'] = item['type'] div['id'] = item['id'] h = soup.new_tag('h1') div.append(h) h.append(item['title']) if item['text']: text = bs4.BeautifulSoup(_fix_ptags(item['text']), 'html.parser') div.append(text) return div def html_render_poll(*, soup, item): options = select_poll_options(item['id']) div = html_render_story(soup=soup, item=item) for option in options: div.append(html_render_pollopt(soup=soup, item=option)) return div def html_render_pollopt(*, soup, item): div = soup.new_tag('div') div['class'] = item['type'] text = bs4.BeautifulSoup(_fix_ptags(item['text']), 'html.parser') div.append(text) points = soup.new_tag('p') points.append(f'{item["score"]} points') div.append(points) return div def html_render_story(*, soup, item): div = soup.new_tag('div') div['class'] = item['type'] div['id'] = item['id'] h = soup.new_tag('h1') div.append(h) if item['url']: a = soup.new_tag('a', href=item['url']) a.append(item['title']) h.append(a) else: h.append(item['title']) if item['text']: text = bs4.BeautifulSoup(_fix_ptags(item['text']), 'html.parser') div.append(text) userinfo = soup.new_tag('p') div.append(userinfo) author = item['author'] username = soup.new_tag('a', href=f'https://news.ycombinator.com/user?id={author}') username.append(author) userinfo.append(username) userinfo.append(' | ') date = datetime.datetime.utcfromtimestamp(item['time']) date = date.strftime('%Y %b %d %H:%M:%S') timestamp = soup.new_tag('a', href=f'https://news.ycombinator.com/item?id={item["id"]}') timestamp.append(date) userinfo.append(timestamp) userinfo.append(' | ') points = soup.new_tag('span') points.append(f'{item["score"]} points') userinfo.append(points) return div def html_render_page(tree): soup = bs4.BeautifulSoup() html = soup.new_tag('html') soup.append(html) head = soup.new_tag('head') html.append(head) style = soup.new_tag('style') style.append(''' .comment, .job, .poll, .pollopt, .story { padding-left: 20px; margin-top: 4px; margin-right: 4px; margin-bottom: 4px; } .job, .poll, .story { border: 2px solid blue; } body > .story + .comment, body > .comment + .comment { margin-top: 10px; } .comment, .pollopt { border: 1px solid black; } ''') head.append(style) body = soup.new_tag('body') html.append(body) item = tree.data if item['type'] == 'comment': body.append(html_render_comment_tree(soup=soup, tree=tree)) elif item['type'] == 'job': body.append(html_render_job(soup=soup, item=item)) elif item['type'] == 'poll': body.append(html_render_poll(soup=soup, item=item)) for child in tree.list_children(sort=lambda node: node.data['time']): body.append(html_render_comment_tree(soup=soup, tree=child)) elif item['type'] == 'story': body.append(html_render_story(soup=soup, item=item)) for child in tree.list_children(sort=lambda node: node.data['time']): body.append(html_render_comment_tree(soup=soup, tree=child)) return soup # COMMAND LINE ##################################################################################### DOCSTRING = ''' hnarchive.py ============ {get} {html_render} {livestream} {update} {update_items} TO SEE DETAILS ON EACH COMMAND, RUN > hnarchive.py