Bittorrent Notes

Prerequisites

OS: Assuming Fedora [24-26] or Centos 7
HW: x86/64, i5, 16GB memory
NET: 2Gbytes/month

For AWS, this is m4.xlarge or m3.xlarge with 21 in / 100 out Gbytes per week for a 96-torrent 2-hr scraper. (Aka, got-705). Expect perhaps twice nework capacity this to saturate sample.

Cloud dedicated host or virtual server footprint: HP ProLiant DL120, Centos 7, 10TB/month.

Development machine is intel NUC Kit: Core i5-6260U Slim or Core i5-7260U (7I5BNK) Slim, 16GB, Fedora 25.

Software Setup

Development baseline is gcc-6.4.1, using C++17, boost-1.60, custom libtorrent from 2017-08-01.

To bootstrap gcc, download the gcc-6-branch from gcc.gnu.org, and configure for c/c++. Make, install. Note that “-std=gnu++17” is required as CXXFLAGS. Set CC as gcc-6.4, CXX as g++-64, and CXXFLAGS as “-g -O2 -std=gnu++17” via exports in .bashrc.

To rebuild boost on Centos 7, download the source RPM on F25, transfer it to the Centos 7 machine, and unpack it. Then:

yum install -y cmake bzip2-devel python-devel python3-devel libicu-devel  openmpi-devel  mpich-devel

To rebuild boost, a little bit of fun in the rpmbuild directory’s SPEC subdirectory:

rpmbuild -ba boost.spc --without python3 --without mpich LDFLAGS+=--build-id

Then install the generated RPMS.

Get some systems diagnostic tools for monitoring network load, VPN support, nmcli, etc.

dnf install -y nload NetworkManager NetworkManager-openvpn openvpn

GeoIP installed as

dnf install -y GeoIP GeoIP-devel GeoIP-GeoLite-data GeoIP-GeoLite-data-extra;
dnf install -y libcurl libcurl-devel;
dnf install -y libevent libevent-devel --allowerasing;
dnf install -y intltool;

Test that this is working by using geoiplookup on the command line

%geoiplookup 8.8.8.8
GeoIP Country Edition: US, United States
GeoIP City Edition, Rev 1: US, CA, California, Mountain View, 94035, 37.386002, -122.083801, 807, 650
GeoIP ASNum Edition: AS15169 Google Inc.

Other dependencies: rapidjson, rapidxml for visualization interface.

dnf install -y rapidjson rapidjson-devel rapidxml-devel;

Assuming SSL and crypto support.

dnf install -y openssl openssl-libs openssl-devel;

For the bittorrent protocol, use libtorrent built from source.

In addition to the operating system packages, python support for geolocation is needed. In particular, need bencode, geoip2, requests. So:

pip install GeoIP
pip install GeoIP2
pip install bencode
pip install requests

Required System Configuration

Remove rpcbind for datacenter use as per BUND warning. See disable portmapper notes.

Have to set ulimits to unlimited (min 16k) because of open file limits in seeding mode.

Background

Part one of Bittorrent distribution research was Fall 2015. Part two is Fall 2016. Part three is Spring & Summer 2017.

Bittorrent. Common terms and jargon.

Basic idea as per The BitTorrent Protocol Specification. Of this, of note is tracker and in particular the tracker scraper protocol.

Magnet links wikipedia entry. PEX, DHT, Magnet links all from lifehacker.

Questions

1. Trackers. List of trackers, announcements, UDP, http.
Fetishizing the most current trackers: 1, 2.

2. Public torrents, Private torrents, SSL torrents. Why are some of these able to be scraped, and others, not so much? From transmission-show -s my.torrent, will get status for public and private, but not ssl torrents.

3. Look at transmission dependencies: openssl, libcurl. See libtransmission includes: transmission.h, variant.h, utils.h. See: struct tr_tracker_stat, tr_torrentTrackers, tr_torrentTrackersFree, tr_info, tr_tracker_info, tr_torrent_activity. Of note, it looks like multiple reads of peers from tracker. TR_PEER_FROM_PEX, TR_PEER_FROM_DHT, TR_PEER_FROM_TRACKER.

4. Look at SSL cert example in libtorrent. See libtorrent github source repository.

5. What kind of tree/node visualizations will work? See The Book of Trees.

Deliverables

One. Data set 1 is four torrents from October 11 for the same recent serial television episode, each a different uploading group: LOL, DIMENSION, AFG, mSD.

Two. Data set 2 is three torrents from October 17, over 8 locales, during 3 time periods.

Three: Prepare for TWD 2016-10-23 Season 7 premiere.

  • visualization: three or more approaches, data types, presentation/exhibition,prototype
  • scraper analytics
  • archival data format
  • persistent scraping, archival interface