Bittorrent Notes

Prerequisites

OS: Assuming Fedora [24-26] or Centos 7
HW: x86/64, i5, 16GB memory
NET: 2Gbytes/month

For AWS, this is m4.xlarge or m3.xlarge with 21 in / 100 out Gbytes per week for a 96-torrent 2-hr scraper. (Aka, got-705). Expect perhaps twice nework capacity this to saturate sample.

Cloud dedicated host or virtual server footprint: HP ProLiant DL120, Centos 7, 10TB/month.

Development machine is intel NUC Kit: Core i5-6260U Slim or Core i5-7260U (7I5BNK) Slim, 16GB, Fedora 25.

Software Setup

Development baseline is gcc-6.4.1, using C++17, boost-1.60, custom libtorrent from 2017-08-01.

To bootstrap gcc, download the gcc-6-branch from gcc.gnu.org, and configure for c/c++. Make, install. Note that “-std=gnu++17” is required as CXXFLAGS. Set CC as gcc-6.4, CXX as g++-64, and CXXFLAGS as “-g -O2 -std=gnu++17” via exports in .bashrc.

To rebuild boost on Centos 7, download the source RPM on F25, transfer it to the Centos 7 machine, and unpack it. Then:

yum install -y cmake bzip2-devel python-devel python3-devel libicu-devel  openmpi-devel  mpich-devel

To rebuild boost, a little bit of fun in the rpmbuild directory’s SPEC subdirectory:

rpmbuild -ba boost.spc --without python3 --without mpich LDFLAGS+=--build-id

Then install the generated RPMS.

Get some systems diagnostic tools for monitoring network load, VPN support, nmcli, etc.

dnf install -y nload NetworkManager NetworkManager-openvpn openvpn

GeoIP installed as

dnf install -y GeoIP GeoIP-devel GeoIP-GeoLite-data GeoIP-GeoLite-data-extra;
dnf install -y libcurl libcurl-devel;
dnf install -y libevent libevent-devel --allowerasing;
dnf install -y intltool;

Test that this is working by using geoiplookup on the command line

%geoiplookup 8.8.8.8
GeoIP Country Edition: US, United States
GeoIP City Edition, Rev 1: US, CA, California, Mountain View, 94035, 37.386002, -122.083801, 807, 650
GeoIP ASNum Edition: AS15169 Google Inc.

Other dependencies: rapidjson, rapidxml for visualization interface.

dnf install -y rapidjson rapidjson-devel rapidxml-devel;

Assuming SSL and crypto support.

dnf install -y openssl openssl-libs openssl-devel;

For the bittorrent protocol, use libtorrent built from source.

In addition to the operating system packages, python support for geolocation is needed. In particular, need bencode, geoip2, requests. So:

pip install GeoIP
pip install GeoIP2
pip install bencode
pip install requests

Required System Configuration

Remove rpcbind for datacenter use as per BUND warning. See disable portmapper notes.

Have to set ulimits to unlimited (min 16k) because of open file limits in seeding mode.

Background

Part one of Bittorrent distribution research was Fall 2015. Part two is Fall 2016. Part three is Spring & Summer 2017.

Bittorrent. Common terms and jargon.

Basic idea as per The BitTorrent Protocol Specification. Of this, of note is tracker and in particular the tracker scraper protocol.

Magnet links wikipedia entry. PEX, DHT, Magnet links all from lifehacker.

Questions

1. Trackers. List of trackers, announcements, UDP, http.
Fetishizing the most current trackers: 1, 2.

2. Public torrents, Private torrents, SSL torrents. Why are some of these able to be scraped, and others, not so much? From transmission-show -s my.torrent, will get status for public and private, but not ssl torrents.

3. Look at transmission dependencies: openssl, libcurl. See libtransmission includes: transmission.h, variant.h, utils.h. See: struct tr_tracker_stat, tr_torrentTrackers, tr_torrentTrackersFree, tr_info, tr_tracker_info, tr_torrent_activity. Of note, it looks like multiple reads of peers from tracker. TR_PEER_FROM_PEX, TR_PEER_FROM_DHT, TR_PEER_FROM_TRACKER.

4. Look at SSL cert example in libtorrent. See libtorrent github source repository.

5. What kind of tree/node visualizations will work? See The Book of Trees.

Deliverables

One. Data set 1 is four torrents from October 11 for the same recent serial television episode, each a different uploading group: LOL, DIMENSION, AFG, mSD.

Two. Data set 2 is three torrents from October 17, over 8 locales, during 3 time periods.

Three: Prepare for TWD 2016-10-23 Season 7 premiere.

  • visualization: three or more approaches, data types, presentation/exhibition,prototype
  • scraper analytics
  • archival data format
  • persistent scraping, archival interface

libabigail aka C++ Instrumentation and Analysis


Background

Libabigail is shorthand for the alternative, which just so happens to be a bit of a mouthful: “GNU Application Binary Interface Generic Analysis and Instrumentation Library.”

This is a current compiler/language research topic to provide a serialized XML form of C++11 sources as compiled by GNU g++, and a way of looking at the data produced. This data can be parsed to more accurately determine ABI compatibility, to better understand code additions and changes and how these change the exported interface, to examine and prototype how C++11 language usage determines linkage, etc.

Discussions about this functionality started at the “C++ ABI BOF” at the GNU Tools Cauldron 2012 Prague. This work was created at Red Hat, by Benjamin Kosnik, Jason Merrill, and Dodji Seketeli. Some updates at 2013 Cauldron. See “Cauldron 2013 GCC ABI BOF.”

Development sources are written in mixed C++2003/C++11, hosted in git, based on GCC trunk, and tracking what will to be gcc-4.9.0. The branch is administered by Dodji Seketeli.

Please feel free to try it out, but know that the state is experimental and quite raw.

Feedback and assistance is welcome.

Starting from a git working tree as described in GitMirror, add the libabigail repository as follows:

git checkout -b libabigail origin/libabigail

To stay up to date, use:

git pull


Overview

How is this expected to be used? First, a libabigail top-level directory is either added to the GCC sources or compiled as a first step and put into some PREFIX directory. The GNU C++ compiler, g++, is configured to use this new library with:

configure .. --with-abigail=$PREFIX

Thus configured, the C++ front end is built, installed, and used as the primary compiler. All sources are compiled with an additional flag, -fdump-abi.

So, this command:

g++ -c -fdump-abi somefile.cc

Creates two files:

  • somefile.o

    The object file

  • somefile.cc.bi

    The XML instrumentation file


API/ABI

basics

Toplevel namespace is abigail.

The interface header files in libabigail:

abg-ir.h
abg-corpus.h

Doxgen is used to document the sources: try make html to generate, and look in libabigail-build-dir/doc/api/html/index.html to read it.

And then the binary interface is in libabigail.so.

notes

Each object file is compiled to a translation_unit. The sum of all translation_units is a corpus.

Compiler-generated files are read as serialized input to a translation_unit and de-serialized. And any modified form is written to an output file in serialized form.

The interface to the C++ intermediate representation is best viewed in the class documentation.

Opinions and Wild Guesses

1. Some formatting tips.

– classes “read” as types, data, members functions. In that order.

– doxygen gives feedback on the state of the doxygen parse in the form of a log, as you run “make html.” Read this log: doxygen is a fuzzy parse. There are formatting things you can do to make it better. Do them. It’s easier to fix up these errors then figure out why the generated HTML is poor.

2. Use of shared_ptr is intriguing.

There are not really a lot of existing usage patterns for std::shared_ptr in libstdc++ (in C++11 , , ). If you look at the page of boost idioms for shared_ptr usage:

http://www.boost.org/doc/libs/1_54_0/libs/smart_ptr/sp_techniques.html

One notices that there’s not a lot of use of shared_ptrs in interfaces. Yet in libabigail, that is very common. I’m curious about this style question.

And most usage is up for debate, see this stack overthow discussion about using shared_ptrs as function arguments. Should the parameters be const reference or just shared_ptr? And another.

Some interesting thinking from microsoft on shared_ptr usage.

3. Use of virtual binary operators is odd.

The old adage is that operators cause havoc in overload resolution. These are binary operators, but the stigma lingers. A vague feeling is not the same as something definite that’s a hard no. It’s more like the pirate code than a strict coding convention or hard rule. I would say that if you ever start to see strange bugs due to overloading, consider making these (non-operator) functions.

Otherwise, do it.

DOT Notes

DOT is a graph description language. It’s the language used to drive around the Graphviz tools.

Some basics about the DOT language grammar.

Graphviz Overview

Some graphviz basics, DOT language reference and users guides, wiki. Of particular interest are the “Node, Edge, and Graph Attributes.”

Usual command-line invocation looks like:

dot -v -Tsvg:cairo -o myfile.svg myfile.gv

And then fonts are in ~/.fonts or /usr/share/fonts, and can be controlled via the following attributes:

fontpath="/usr/share/fonts/dejavu"
fontname="DejaVuSansMono"

These should map to installed fonts, ie

Doxygen vs. generated graphviz class hierarchy visualizations

Here is a breakdown of the generation steps doxygen uses to visualize class hierarchy. Requisite software includes doxygen-1.8.3 on C++/C++11 files, compilation and development environment is Fedora 18/x86_64 using GNU C++ version 4.8.

Doxygen Overview

Some Doxygen basics, and internals. The Fedora package is doxygen-1.8.3-3.fc18.x86_64, the command line invocation is: doxygen, which is a C++ binary.

To make the doxygen binary debuggable, check out doxygen in subversion and configure the build with --debug. (On Fedora, some other dependencies are required, like qt-devel. An alias between what the Makefiles expect, ie code and the installed qmake-qt4 needs to be defined).

For this investigation, the subject of most interest is the language parser for C++, breaking in parseInput(). The doxygen parse phase lowdown:

The task of the parser is to convert the input buffer into a tree of entries (basically an abstract syntax tree). An entry is defined in src/entry.h and is a blob of loosely structured information. The most important field is section which specifies the kind of information contained in the entry.

The other area of interest is the output generator for graphviz sources and then generated diagrams. So, breaking in function generateOutput() (see src/doxygen.cpp), step until

  if (Config_getBool("HAVE_DOT"))
  {
    DotManager::instance()->run();
  }

This is the part that generates the graphviz source files and then uses dot to create output from the previously-parsed C++ source data. Breaking in function DotManager::run() (see src/dot.cpp) allows stepping through individual graph creation.

To be determined: file name, class name mapping to Doxygen internals.

Graphviz Overview

Some graphviz basics, DOT language reference and users guides, wiki. Of particular interest are the “Node, Edge, and Graph Attributes.”

Usual command-line invocation looks like:

dot -v -Tsvg:cairo -o myfile.svg myfile.gv

And then fonts are in ~/.fonts or /usr/share/fonts, and can be controlled via the following attributes:

fontpath="/usr/share/fonts/dejavu"
fontname="DejaVuSansMono"

These should map to installed fonts, ie

%fc-match "DejaVuSansMono"
DejaVuSansMono.ttf: "DejaVu Sans Mono" "Book"

Doxygen Settings

Parts of the doxygen configuration file that matter, the config settings used, and any commentary.

HAVE_DOT=YES

CLASS_GRAPH=YES

UML_LOOK=NO

COLLABORATION_GRAPH=NO (Interesting on a per-class basis only. For larger projects the noise becomes overwhelming.)

CALL_GRAPH=NO (Same.)

CALLER_GRAPH=NO (Same.)

INCLUDE_GRAPH=NO (Same.)

INCLUDED_BY_GRAPH=NO (Same.)

TEMPLATE_RELATIONS=NO (Relations between primary templates and template instances is very cluttered, noise value high. Template relationships and class hierarchy relations in non-UML mode are displayed on the same diagram, but use a different visual grammar. Classes inherit base to derived. Throw templates in and they read "as if" from base to primary template to specific instance. This should instead be base to specific instance.)

DOT_GRAPH_MAX_NODES=50

MAX_DOT_GRAPH_DEPTH=0

DOT_IMAGE_FORMAT=SVG (Resolution-independant text, editable, lossless)

INTERACTIVE_SVG=YES (Focus control for big diagrams)

With these settings, a PDF file of the GNU C++11 API runs over three thousand five hundred pages.

Doxygen XML attribute for Graphviz

Legend for doxygen-generated graphviz diagrams.

2) what attributes are needed in XML to represent this?
3) what are the added attributes/markup needed to get longstanding bugs fixed? Or are these solely parse errors?

Generated Diagram Quality

Sample set is GCC-4.8.0 C++ docset, based on a generated output on 2013-03-10.

Representative diagrams for more traditionally-styled C++ code, in the form of OO-style class hierarchies, are found based off of the std::exception and std::ios_base root elements.

  1. std::exception, just the class hierarchy diagram, and the Exceptions Module
  2. std::ios_base, just the class hierarchy diagram, and the IO Module

Starting with the exception diagram, because the lack of templates in this hierarchy is a useful simplifying factor. This diagram is accurate. Layout issues include: names overflowing the bounding boxes (__gnu_pbds::intsert_error) and line break issues (same, but others like __gnu_cxx::recursive_init_error). Many of the line connectors and paths to endpoints are infuriatingly erratic. These issues are largely look to be the kind of thing that could be tweaked via various dot settings, or related graphviz tool settings (like neato).

Next diagram: io. This hierarchy diagram is largely accurate, but with distracting elements,  and extraneous information. This is a multi-level class hierarchy with both base classes and base class templates. Starting from the left, reading to the right. The least-derived base is ios_base. A class template for basic_ios derives from it, taking two parameters. In this diagram, two templates derive from ios_base: the primary class template for basic_ios, and a fully-specialized class template for basic_ios instantiated with the integer type char. A couple of things to note, one level in to the diagram:

  • restricting to just primary templates would be useful, ie this diagram without any specializations. Indeed, this is what the diagrams evolve into once level two and above diagrams are cleaved off, ie starting from basic_istream (instead of ios_base) and going to basic_streamstream.
  • there are actually two specializations for this hierarchy, both char and wchar_t. Where’s wchar_t?
  • there are actually typedefs like ios and wios for the char and wchar_t instantiations of the basic_ios class template.
  • basic_ostream char specialization is duplicated, once for basic_ostream<char>, and one for basic_ostream<char, char_traits, char>. Neither of these instances actually exists. There’s a similar phantom template instance for basic_iostream.

Let’s stop here with this diagram. The rest of the hierarchy has similar issues.

Next, let’s examine some template-heavy components and idioms, like policy based design.
Components that use this idiom are found in the class hierarchies for std::allocator, std::unordered_map (and the other unordered_containers), __gnu_cxx::vstring, and the policy-based data structures extensions for which __gnu_pbds::trie.

  1. std::allocator
  2. __gnu_cxx::__versa_string
  3. __gnu_pbds::trie

For the first class, std::allocator, the generated diagram is accurate. Ideally, there would be a visual marker for the allocator void specialization, and note about the grouping of the superset of extension allocators as the base class for std::allocator.

The two extension classes share common issues, and none of have accurately-generated class hierarchies. Both make use of multiple base classes and policy-based design.

Finally, a pass at some C++11 features. Of note are things like variadic templates (invented for std::tuple and then used elsewhere) and template aliases (used in may parts of the library with policy-based designs, like std::allocator and std::unordered_map.)

  1. std::tuple
  2. std::unordered_map

Just making a quick pass here. From the tuple diagram, ponder the implied template relations. This is hard, since making sense of this with a visual grammar would require better grouping between primary template, partial specializations, and full specializations.

And for unordered_map, apparently the complexity of the derivation, plus the templates, plus the use of C++11 features like alias templates aborts the graph. No hierarchy is not an accurate hierarchy.

Doxygen use is not considered harmful, even with these flaws it is an invaluably useful tool. Reasonable people may differ, of course.

Sources (C++11, graphviz)

For a given set of sources:

struct base
{
  enum mode : short { in, out, top, bottom };
  typedef long value_type;
};

struct A : public base
{
  int _M_i;
  int _M_n;
};

struct B : public base
{
  value_type _M_v;

  constexpr B(value_type __v = 6) : _M_v(__v) { };
};

struct C : public B
{
  constexpr value_type
  square() { return _M_v * _M_v; }
};

struct D : public A
{
  D(const D& __d) : A() { };

  ~D() { }
};

Next, use doxygen to generate HTML, with HAVE_DOT set to YES and DOT_CLEANUP set to NO in the doxygen configuration file. With this configuration tweak, when doxygen is used to generate HTML, the doxygen-generated graphviz sources used to create the class diagrams are not destroyed. On examination, they produce the following graphic:

structbase__inherit__graph

And then, look at the generated graphviz for the base class, the root of the diagram:

digraph "base"
{
  // edge and node defaults
  edge [fontname="FreeSans",fontsize="9",labelfontname="FreeSans",labelfontsize="9"];
  node [fontname="FreeSans",fontsize="9",shape=record];

  // actual graph
  Node1 [label="base",height=0.2,width=0.4,color="black", fillcolor="grey75", style="filled" fontcolor="black"];
  Node1 -> Node2 [dir="back",color="midnightblue",fontsize="9",style="solid",fontname="FreeSans"];
  Node2 [label="A",height=0.2,width=0.4,color="black", fillcolor="white", style="filled",URL="$struct_a.html"];
  Node2 -> Node3 [dir="back",color="midnightblue",fontsize="9",style="solid",fontname="FreeSans"];
  Node3 [label="D",height=0.2,width=0.4,color="black", fillcolor="white", style="filled",URL="$struct_d.html"];
  Node1 -> Node4 [dir="back",color="midnightblue",fontsize="9",style="solid",fontname="FreeSans"];
  Node4 [label="B",height=0.2,width=0.4,color="black", fillcolor="white", style="filled",URL="$struct_b.html"];
  Node4 -> Node5 [dir="back",color="midnightblue",fontsize="9",style="solid",fontname="FreeSans"];
  Node5 [label="C",height=0.2,width=0.4,color="black", fillcolor="white", style="filled",URL="$struct_c.html"];
}

 

Facebook + Art

Facebook in art practice.

Like it or Unfriend it., New York Times, July 3, 2011

Teen Age, You Just Don’t Understand, Catharine Clark Gallery, SF, June 20, 2010. Site archive.

Mutual Friends Visualizations. Visualizing things like political affiliations + geo tags.

Facebook analytics

webtrends, per-post analytics for pages, reading facebook analytics, analytic arm race

To get data, use Account->Settings->Export, or use something like Give Me My Data.

To get some idea of your own behavior, this stalking thing.

Facebook aliases.

Privacy, sharing. Use of aliases on Facebook, couple dog aliases, activist dog aliases.

See Zhao Jing Controversy, Zuckerberg Puppy Page, Tulip, Asta, others. EFF position on pseudonymity on-line and the nymwars. Super-cogent break-down of who is harmed by a real-names policy.

Not about facebook per se, but instead about web analytics, privacy, tracking: collusion.

Europe vs. Facebook, and their report.

Figurative Art Censorship vs. Facebook.

A good perspective on where user-generated content goes as social media sites die.

An article in the gray lady comparing those who don’t have facebook in the year 2011 to those who grew up in the 80s and 90s without television. Since I was Amish about television before I met my lovely wife, this comparison has great appeal.

Using R to visualize the friend graph, and the first post about crawling facebook data with R. Along the way, I discover Rcpp, which is a C++ mapping to R, looks very interesting.

An example data trail, with just a part the facebook trail.

More examples of Facebook and visual art, performance art, etc.

Some of the initial response to Facebook’s graph search, including the display of generated graphs, made some interesting statements.

Using Gephi to visualize Facebook connections.

PLOS ONE: Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. Gender, word choice, visualizations.

Looking at APIs of social media companies, including Facebook. What you Look like to a Social Network. Also, the onavo app, acquired by Facebook.

When You Fall in Love, This Is What Facebook Sees, The Atlantic, , 2014-02-14