- Jul 15, 2018
- 3
- 2
A while back I made a python script for downloading stuff from yiff.party. I thought it might be useful for others as well. I am attaching the file as .txt since .py extension is not allowed.
Requirements:
usage: yiff-dl.py [-h] [-n CONNECTIONS] [-p PAGE] [-t POST] [-i] [-d {pre,post}] [-o OUTPUT_DIR]
[--downloader DOWNLOADER] [--downloader-args DOWNLOADER_ARGS]
url
positional arguments:
url
optional arguments:
-h, --help show this help message and exit
-n CONNECTIONS, --connections CONNECTIONS
the maximum number of files to download concurrently
-p PAGE, --page PAGE download only specified pages (n for just page n, :n for upto page n, n: for n to end, n1:n2
for n1 to n2)
-t POST, --post POST download only specified posts (n for just post n, :n for upto post n, n: for n to end, n1:n2
for n1 to n2)
-i, --write_info write info htmls
-d {pre,post}, --id_format {pre,post}
prefix or postfix id
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
output directory
--downloader DOWNLOADER
specify external downloader
--downloader-args DOWNLOADER_ARGS
specify external downloader arguments
-n, --connections essentially creates n processes using python's multiprocessing library and start downloading one file with each process. Don't keep this too high, maybe 4-6.
To download a single page or some specified range of pages use -p, --page. For example let's say some creator has 10 pages on yiff.party, then -p 3 only downloads page 3, -p 2:5 downloads pages 2,3,4,5, -p :5 downloads pages 1,2,3,4,5 and -p 6: downloads pages 6,7,8,9,10.
Use -t, --post to download specific posts from a single page. Best used along with -p n.
Use -i to write out local html files so that you have a local working copy which can be viewed in a browser.
By default the files are prefixed with post id. Use -d post, --id_format post to postfix it instead.
You can use external downloaders like aria2 or axel to download the files, otherwise python's requests library will be used. It's best to use an external downloader since they allow resuming and skipping already existing downloads. Currently only downloaders that support a -o output flag can be used. Thus wget cannot be used, curl, axel, aria2 can be used.
My preferred usage is:
python yiff-dl.py -n 5 -d post -i --downloader axel --downloader-args " -n 10" "<url>"
OR, with aria2
python yiff-dl.py -n 5 -d post -i --downloader aria2c --downloader-args "-x 10" "<url>"
Requirements:
- Python 3
- beautifulsoup4
usage: yiff-dl.py [-h] [-n CONNECTIONS] [-p PAGE] [-t POST] [-i] [-d {pre,post}] [-o OUTPUT_DIR]
[--downloader DOWNLOADER] [--downloader-args DOWNLOADER_ARGS]
url
positional arguments:
url
optional arguments:
-h, --help show this help message and exit
-n CONNECTIONS, --connections CONNECTIONS
the maximum number of files to download concurrently
-p PAGE, --page PAGE download only specified pages (n for just page n, :n for upto page n, n: for n to end, n1:n2
for n1 to n2)
-t POST, --post POST download only specified posts (n for just post n, :n for upto post n, n: for n to end, n1:n2
for n1 to n2)
-i, --write_info write info htmls
-d {pre,post}, --id_format {pre,post}
prefix or postfix id
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
output directory
--downloader DOWNLOADER
specify external downloader
--downloader-args DOWNLOADER_ARGS
specify external downloader arguments
-n, --connections essentially creates n processes using python's multiprocessing library and start downloading one file with each process. Don't keep this too high, maybe 4-6.
To download a single page or some specified range of pages use -p, --page. For example let's say some creator has 10 pages on yiff.party, then -p 3 only downloads page 3, -p 2:5 downloads pages 2,3,4,5, -p :5 downloads pages 1,2,3,4,5 and -p 6: downloads pages 6,7,8,9,10.
Use -t, --post to download specific posts from a single page. Best used along with -p n.
Use -i to write out local html files so that you have a local working copy which can be viewed in a browser.
By default the files are prefixed with post id. Use -d post, --id_format post to postfix it instead.
You can use external downloaders like aria2 or axel to download the files, otherwise python's requests library will be used. It's best to use an external downloader since they allow resuming and skipping already existing downloads. Currently only downloaders that support a -o output flag can be used. Thus wget cannot be used, curl, axel, aria2 can be used.
My preferred usage is:
python yiff-dl.py -n 5 -d post -i --downloader axel --downloader-args " -n 10" "<url>"
OR, with aria2
python yiff-dl.py -n 5 -d post -i --downloader aria2c --downloader-args "-x 10" "<url>"
Python:
import argparse
import os
import requests
import subprocess
from multiprocessing import Pool, Value
from bs4 import BeautifulSoup
# Global counter for downloads
counter = None
def init(c):
"""Initialize global counter.
"""
global counter
counter = c
def write_info(info, page, pages, output_dir, patreon_title):
"""Write html info file.
Parameters
----------
info : dict
Dict with all the necessary info.
page : int
Page number for which html is to be written.
pages : int
Total number of pages.
output_dir : str
The output directory.
patreon_title : str
The patreon title.
"""
# Header
html = '<!DOCTYPE html><html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"><meta http-equiv="content-language" content="en"></head><body style="margin:auto;width:50%;background-color:#eeeeee;color:#424242;font-family:\'Segoe UI\', Tahoma, Geneva, Verdana, sans-serif;text-align: justify"><br>'
# Top page links
if pages > 1:
html += '<div style="word-spacing: 15px">'
for p in range(1, pages+1):
html += f'<a href="info{p}.html">{p}</a> ' if p != page else f'{p} '
html += '</div><hr style="border: dashed 2px"/>'
# Write info
for i in info:
html += f'<h3>{i["title"]}</h3>'
if 'thumb' in i:
html += f'<img src="{i["thumb"]}" height=200><br><br>'
html += f'<b>id</b>: <i>{i["id"]}</i>  <b>{i["time"]}</b>'
if 'post' in i:
html += f'    <a href="{i["post"]}"><b>Post File</b></a>'
if 'incomplete' in i:
html += f'    <span style="color:#c62828"><b>INCOMPLETE</b></span>'
html += '<br>'
if i['body'] != '':
html += f'<br><div>{i["body"]}</div>'
if 'embed' in i:
html += f'<br><a href="{i["embed"]}"><b>EMBED URL</b></a><br>'
if 'attachments' in i and len(i['attachments']) > 0:
html += '<br><b>Attachments</b><br>'
for att in i['attachments']:
html += f'<a href="{att[0]}">{att[1]}</a><br>'
html += '<br><hr style="border: dashed 2px"/>'
# Bottom page links
if pages > 1:
html += '<div style="word-spacing: 15px">'
for p in range(1, pages+1):
html += f'<a href="info{p}.html">{p}</a> ' if p != page else f'{p} '
html += '</div><br></body></html>'
# Write out to file
with open(os.path.join(output_dir, patreon_title, f'info{page}.html'), 'w', encoding='utf8') as f:
f.write(html)
def downloadfile(all_args):
"""Download a single file using whichever downloader is specified.
Parameters
----------
all_args : tuple
all arguments as a tuple (script_args, url, file_name).
"""
global counter
# Extract arguments
args, url, file_name = all_args
# Fix url
if not url.startswith('http'):
url = 'https://yiff.party' + url
# Download file
try:
# Use external downloader if specified else use requests lib
if args.downloader is None:
file = requests.get(url)
with open(file_name, 'wb') as f:
f.write(file.content)
else:
call_list = [args.downloader]
if args.downloader_args is not None:
call_list += args.downloader_args.split()
call_list += ['-o', file_name, url]
subprocess.call(call_list, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
# Update counter
with counter.get_lock():
counter.value += 1
print(f'\r{counter.value} files downloaded', end='')
except (requests.RequestException, FileNotFoundError, OSError):
pass
def downloadpost(args, post, patreon_title):
"""Download a single post.
Parameters
----------
args : dict
Script arguments.
post : dict
Post to download.
patreon_title : str
Patreon title.
Returns
-------
info : dict
info about the post.
"""
downloadables = []
types = {}
# Extract id, time and title
id_ = post['id'][1:]
time = post.select('small.post-time')[0].text[:10]
title = post.select('span.card-title')[0].text
if title.endswith('more_vert'):
title = title[:-9]
else:
title = title[:-25]
info = {'id': id_, 'time': time, 'title': title}
print(title)
# Extract thumbnail
if len(post.select('div.card-image')) > 0:
assert len(post.select('div.card-image')) == 1
card_image = post.select('div.card-image')[0].select('img')[0]['data-src']
bname = os.path.basename(card_image)
if args.id_format == 'pre':
card_image_local = os.path.join(args.output_dir, patreon_title, 'thumbs', f'{id_}_{bname}')
else:
card_image_local = os.path.join(args.output_dir, patreon_title, 'thumbs',
f'{bname[:bname.rfind(".")]}_{id_}{bname[bname.rfind("."):]}' if bname.rfind('.') != -1 else f'{bname}_{id_}')
downloadables.append((args, card_image, card_image_local))
types['card image'] = 1
info['thumb'] = os.path.join('thumbs', os.path.basename(card_image_local))
# Extract post file
if len(post.select('div.card-action')) > 0:
assert len(post.select('div.card-action')) == 1
post_file = post.select('div.card-action')[0].select('a')[0]['href']
bname = os.path.basename(post_file)
if args.id_format == 'pre':
post_file_local = os.path.join(args.output_dir, patreon_title, 'posts', f'{id_}_{bname}')
else:
post_file_local = os.path.join(args.output_dir, patreon_title, 'posts',
f'{bname[:bname.rfind(".")]}_{id_}{bname[bname.rfind("."):]}' if bname.rfind('.') != -1 else f'{bname}_{id_}')
downloadables.append((args, post_file, post_file_local))
types['post file'] = 1
info['post'] = os.path.join('posts', os.path.basename(post_file_local))
# Extract body
assert len(post.select('div.post-body')) == 1
post_body = post.select('div.post-body')[0]
# Extract inlines
for img in post_body.select('img.post-img-inline'):
bname_i = os.path.basename(img.parent['href'])
bname_t = os.path.basename(img['data-src'])
if args.id_format == 'pre':
img_local = os.path.join(args.output_dir, patreon_title, 'inlines', f'{id_}_{bname_i}')
thumb_local = os.path.join(args.output_dir, patreon_title, 'thumbs', f'{id_}_inline_{bname_t}')
else:
img_local = os.path.join(args.output_dir, patreon_title, 'inlines', f'{bname_i[:bname_i.rfind(".")]}_{id_}{bname_i[bname_i.rfind("."):]}' if bname_i.rfind(
'.') != -1 else f'{bname_i}_{id_}')
thumb_local = os.path.join(args.output_dir, patreon_title, 'thumbs', f'{bname_t[:bname_t.rfind(".")]}_{id_}{bname_t[bname_t.rfind("."):]}' if bname_t.rfind(
'.') != -1 else f'{bname_t}_{id_}_inline')
downloadables.append((args, img.parent['href'], img_local))
downloadables.append((args, img['data-src'], thumb_local))
if 'inlines' not in types:
types['inlines'] = 0
types['inlines'] += 1
if 'inline thumbs' not in types:
types['inline thumbs'] = 0
types['inline thumbs'] += 1
img['src'] = os.path.join('thumbs', os.path.basename(thumb_local))
img['height'] = 200
img.parent['href'] = os.path.join('inlines', os.path.basename(img_local))
info['body'] = post_body.decode_contents()
if info['body'].startswith('<p>') and info['body'].endswith('</p>'):
info['body'] = info['body'][3:-4]
# Extract embedded link
if len(post.select('div.card-embed')) > 0:
assert len(post.select('div.card-embed')) == 1
info['embed'] = post.select('div.card-embed')[0].select('a')[0]['href']
# Extract attachments
info['attachments'] = []
for card_attachment in post.select('div.card-attachments'):
attachments = card_attachment.select('a')
for attachment in attachments:
att_id = attachment['href'].split('/')[-2]
bname = attachment.text[:attachment.text.rfind('.')] + '_' + att_id + attachment.text[attachment.text.rfind(
'.'):] if attachment.text.rfind('.') != -1 else attachment.text + '_' + att_id
if args.id_format == 'pre':
attachment_local = os.path.join(args.output_dir, patreon_title, 'attachments', f'{id_}_{bname}')
else:
attachment_local = os.path.join(args.output_dir, patreon_title, 'attachments',
f'{bname[:bname.rfind(".")]}_{id_}{bname[bname.rfind("."):]}' if bname.rfind('.') != -1 else f'{bname}_{id_}')
downloadables.append((args, attachment['href'], attachment_local))
if 'attachments' not in types:
types['attachments'] = 0
types['attachments'] += 1
info['attachments'].append(
(os.path.join('attachments', os.path.basename(attachment_local)), attachment.text))
# Download files
print(f'{len(downloadables)} files to download.', end='')
print(f' ({", ".join([t+":"+str(types[t]) for t in types])})' if len(downloadables) > 0 else '')
if len(downloadables) > 0:
counter = Value('i', 0)
with Pool(args.connections, initializer=init, initargs=(counter,)) as p:
p.map(downloadfile, downloadables)
if counter.value < len(downloadables):
info['incomplete'] = True
# Return info object
return info
if __name__ == '__main__':
"""Main function.
"""
# Define and parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('url')
parser.add_argument('-n', '--connections',
help='the maximum number of files to download concurrently', type=int, default=1)
parser.add_argument(
'-p', '--page', help='download only specified pages (n for just page n, :n for upto page n, n: for n to end, n1:n2 for n1 to n2)')
parser.add_argument(
'-t', '--post', help='download only specified posts (n for just post n, :n for upto post n, n: for n to end, n1:n2 for n1 to n2)')
parser.add_argument('-i', '--write_info', help='write info htmls', action='store_true')
parser.add_argument('-d', '--id_format', help='prefix or postfix id', choices=['pre', 'post'], default='pre')
parser.add_argument('-o', '--output_dir', help='output directory', default='./')
parser.add_argument('--downloader', help='specify external downloader', default=None)
parser.add_argument('--downloader-args', dest='downloader_args',
help='specify external downloader arguments', default=None)
args = parser.parse_args()
# Get first page
print('Getting webpage')
while True:
page = requests.get(args.url)
if page.status_code == 200:
break
print('Retrying')
soup = BeautifulSoup(page.text, 'html.parser')
# Get pages
if len(soup.select('p.paginate-count')) > 0:
pages = soup.select('p.paginate-count')[0].text
pages = int(pages[pages.find('/')+2:])
print(f'{pages} pages')
else:
pages = 1
patreon_title = soup.select('title')[0].text[:-13]
# Create directories
if not os.path.exists(os.path.join(args.output_dir, patreon_title)):
os.mkdir(os.path.join(args.output_dir, patreon_title))
if not os.path.exists(os.path.join(args.output_dir, patreon_title, 'thumbs')):
os.mkdir(os.path.join(args.output_dir, patreon_title, 'thumbs'))
if not os.path.exists(os.path.join(args.output_dir, patreon_title, 'posts')):
os.mkdir(os.path.join(args.output_dir, patreon_title, 'posts'))
if not os.path.exists(os.path.join(args.output_dir, patreon_title, 'attachments')):
os.mkdir(os.path.join(args.output_dir, patreon_title, 'attachments'))
if not os.path.exists(os.path.join(args.output_dir, patreon_title, 'inlines')):
os.mkdir(os.path.join(args.output_dir, patreon_title, 'inlines'))
p1 = 1
p2 = pages+1
if args.page:
if args.page.startswith(':'):
p2 = int(args.page[1:])+1
elif args.page.endswith(':'):
p1 = int(args.page[:-1])
elif ':' in args.page:
p1 = int(args.page[:args.page.index(':')])
p2 = int(args.page[args.page.index(':')+1:])+1
else:
p1 = int(args.page)
p2 = p1+1
# Download page
for p in range(p1, p2):
if p != 1:
page = requests.get(args.url+f'?p={p}')
soup = BeautifulSoup(page.text, 'html.parser')
posts = soup.select('div.card.yp-post')
posts_start = 0
if args.post and p == p1:
if args.post.startswith(':'):
posts = posts[:int(args.post[1:])+1]
elif args.post.endswith(':'):
posts = posts[int(args.post[:-1])-1:]
posts_start = int(args.post[:-1])-1
elif ':' in args.post:
posts = posts[int(args.post[:args.post.index(':')]) - 1:int(args.post[args.post.index(':')+1:])+1]
posts_start = int(args.post[:args.post.index(':')]) - 1
else:
posts = [posts[int(args.post)-1]]
posts_start = int(args.post)-1
info = []
for i, post in enumerate(posts):
print(f'[Page {p}/{pages}, Post {posts_start+i+1}/{len(posts)}] ', end='')
info.append(downloadpost(args, post, patreon_title))
print('')
if args.write_info:
write_info(info, p, pages, args.output_dir, patreon_title)
Last edited: