HITLIST Generation Software Release
-----------------------------------

WARNING: this software is tied-in with USC/ISI LANDER project infrastructure.
In order to use it, you will need to replicate this LANDER setup.  In 
particular you'll need:

1. Hadoop map/reduce framework and HDFS running and environment
   variables HADOOP_HOME set appropriately.

2. Census datasets need to be stored on NFS made available to all
   hadoop nodes.

3. Special ICMPTrainRecordReader java code needs to be made available
   on the Hadoop cluster for reading binary format of LANDER census
   binary data.  This code is documented and available for download
   here:
   http://www.isi.edu/ant/traces/topology/address_surveys/binformat_description.html

4. Note: this code has been updated to produce history datasets.

5. This code requires the Fsdb perl library, see
http://www.isi.edu/~johnh/SOFTWARE/FSDB and/or CPAN.

Format of the hitlist output is descrbed at
http://www.isi.edu/ant/traces/topology/ip_hitlists/format_description.html

SETUP
=====

* Unpack this tar to an NFS-accessible file system (all Hadoop nodes
  need access to scripts).

* Edit the top-level configuration file `hitlist_config'.  In
  particular:
  - HITLIST_DEST_DIR is the destination directory for the newly
    generated hitlists;

  - HITLIST_INTERMEDIATE_DIR is the directory where intermediate
    data will be accumulated (intermidiate data resulted from
    generation of a hitlist will be used for the generation of
    future hitlists;

  - HITLIST_CENSUS_DIR is the directory where census data is stored

All of these must be in NFS, not HDFS.  All most be readable from all
Hadoop nodes (for example, by a shared filesystem).


RUN SCRIPTS
===========

* cd to the working directory and run:

     ./hitlist_do_all.sh <census_dir>

   where <census_dir> is the census dataset distribution directory
   in the standard file system.  For
   example:

     ./hitlist_do_all.sh $HITLIST_CENSUS_DIR/internet_address_census_it37w-20101124

* If you do it the first time, you need to accumulate enough 
  intermediate output for hitlist algorithm to work.  At 
  least 16 bits of history are needed to calculate hitlist score,
  thus you may need to run this at least 16 times before
  on censuses e.g. it31w .. it46w, before a hitlist-dataset
  for it46 will be generated.

  (Note: at ISI we have already bootstrapped, and we have not tested
  the bootstraping code recently.  If you have trouble with it, please
  contact us.)

  [xxx: to do 2013-05-21: test bootstraping code and provide sample
  input and output.]

* When the script completes, the output will be in
  $HITLIST_CENSUS_DIR (also should contain history
  datasets).
  
  Intermediate results needed for the next run will be in
   $HITLIST_INTERMEDIATE_DIR



BRIEF DESCRIPTION OF ALGORITHM and SCRIPTS
==========================================

* The first thing is to get the latest IANA allocation and prepare it
  in a format that would be used by further map/reduce work.
     
  generate_base_ipv4_hitlist.pl
  => Do the preparation
  
* Then, hitlist generation has four major steps
  with 7 map/reduce jobs.
  
** STEP 1: Prepare intermediate file

 1> census_to_intermediate_(map, red) 

    => get intermediate file for the latest census, call it
    latest_single_intermediate 

    (This latest_single_intermediate contains only responsive IP
    addresses in the latest census.)

    The intermediate file output is of the format:
    	#fsdb -F t ip history   
	0100041d        3ff
	01000534        7ff
	01001011        f8
	...
    which is a hex-encoded IP address and a bitmask for hits history.
     
 2> intermediate_join_(map, red)

    => join the latest_single_intermediate with old intermediate file,
       and get the new intermediate file, call it latest_intermediate

    (This latest_intermediate contain all IP addresses that either was
      responsive in previous censuses or is responsive in latest
      census.  The responsiveness record is attached with each IP
      address in a bitmap format.)

** STEP 2: Get latest raw hitlist

 3> hitlists_(map, red)

    => choose one IP address out of each /24 as representative,
       Representatives are thought to have best responsive records.
       Output is latest_raw_hitlist

** STEP 3: Compare with last hitlist to apply stability

 4> old_hitlists_update_with_intermediate_(map, red)

    => Update last_stable_hitlist with latest_intermeidate,
       output is updated_last_stable_hitlist

       (Renew the responsiveness records of representatives in last_hitlist)

 5> stable_hitlists_(map, red)

    => Compare latest_raw_hitlist with updated_last_stable_hitlist
       to get the latest_stable_hitlist 

       (Not all representatives in last_stable_hitlist are replaced by
        latest_raw_hitlist, a threshold is introduced to achieve
        stability)

 6> old_hitlists_transfer_map
    => Just a format preparation for next update

** STEP 4: Merge with latest IANA allocation to apply completeness

 7> stable_hitlists_map & stable_hitlists_with_IANA_red

    => merge the latest_stable_hitlist with latest IANA 
       allocation to get the latest_hitlist, which is also
       the release version.
       
    Several format and naming process is also applied before final
    release.

Near-final output format of a final hitlist is:

       #fsdb -F t prefix hex_ip score ip       
       01009600        0100963d        76      1.0.150.61
       0100ae00        0100aef5        67      1.0.174.245
       0100e700        0100e790        76      1.0.231.144
       0100ff00        0100ffff        75      1.0.255.255
       ..

The hex-encoded /24 prefix of the block, the hex-encoded IP address of
the target, the score, and the 4-octet equivalent of hex_ip.

Final output format strips off the first column.



    
* For further information about this work, please refer to our published
  paper: 
  
Xun Fan and John Heidemann. Selecting Representative IP Addresses for Internet Topology Studies. 
 In Proceedings of the ACM Internet Measurement Conference. Melbourne, Australia, ACM. November, 2010. 
 <http://www.isi.edu/~johnh/PAPERS/Fan10a.html>.
