NearNamedPlace the MEL way

From Hiscom
Jump to: navigation, search


The current Texpress implementation of MELISR does not have a gazetteer field. Hence, MEL does not presently record generalised locality information and has so far been unable to deliver this information to AVH.

The following script obtains the nearest named place for all geocoded records in MELISR.

In short, the script retrieves geocoded records from MELISR (function getMelisrRecords), then passes the records one by one to the reverse geocode functions, findNearestAu for Australian records and findNearestForeign for foreign ones. The query results returned by these functions are then inserted into the abcd_nearnamedplace table, or, if a record with the same MEL accession number already exists, the existing record is updated (function insertNearest).

The reverse geocoding functions query for places that are within one degree in each direction from the latitude and longitude recorded in the MELISR record, calculate the distance between the MELISR record and all these places (as well as the bearing) and orders the results by this distance. The results are then limited to the first row, so only the place that is closest to the collecting locality is returned. In order to avoid the more obscure place names, only those places categorised as a populated place are included in the query.

The script runs from the command line. There is a compulsory argument (see usage) to set the start date. A second optional argument tells the script the maximum number of records to select from the collections database. This probably does not have a practical value anymore, but was very handy for testing the script. It takes just under a minute to reverse geocode a thousand records. It took about 11 hours to do all geocoded records in MELISR (appr. 680,000).

Note [2012-05-06]: we are still using this script after the migration of our database to Specify. Only the SQL in the getMelisrRecords function has changed.

<?php #reversegeo.php --- Niels Klazenga, 2009-07-27

/*
| usage: php reversegeo.php daily|weekly|monthly|all|yyyy-mm-dd [limit]
|
| Script uses 1995 AUSLIG gazetteer for places in continental Australia;
| the GeoNames geographical database (http://www.geonames.org) is used for places
| outside Australia and Australia's external territories.
|
| Geodetic formulae adapted (as in translated into sql) from 
| http://www.movable-type.co.uk/scripts/latlong.html
|
*/
  // PHP5.3 will give annoying error messages if you leave this out
  date_default_timezone_set('Australia/Melbourne'); 
 
  function connectToDb() {
    @$db = new mysqli('<host>', '<username>', '<password>', '<database>'); // <database> is the collections database
    if (mysqli_connect_errno()) {
      echo 'Error: Could not connect to database' . "\n";
      exit;
    }
    return $db;
  }
 
  function findNearestAu($db, $lat, $lng) {
    // find nearest place (POPL) in AUSLIG gazetteer (gazetteer.gaz)
    // equations from http://www.movable-type.co.uk/scripts/latlong.html
    // (6731 km is the radius of the earth, or rather of a globe with the volume of the earth;
    // turns out the earth is flat after all)
    $sql = $db->query("SELECT 'AUSLIG' as gazetteer, locality AS name, id AS geonameid, 
	    code AS feature_code, state, latitude, longitude,
        (6371 * acos(cos(radians($lat)) * cos(radians(latitude)) *
        cos(radians(longitude) - radians($lng)) +
        sin(radians($lat)) * sin(radians(latitude)))) AS distance,
      MOD((DEGREES(ATAN2(SIN(RADIANS($lng-longitude)) * COS(RADIANS($lat)), 
	    COS(RADIANS(latitude)) * SIN(RADIANS($lat)) -
        SIN(RADIANS(latitude))*COS(RADIANS($lat))*COS(RADIANS($lng-longitude))))+360), 360) AS bearing
      FROM gazetteer.gaz
      WHERE code = 'POPL'
        AND latitude BETWEEN ($lat - 1) AND ($lat + 1)
        AND longitude BETWEEN ($lng - 1) AND ($lng + 1)
      ORDER BY distance ASC
      LIMIT 1");
 
    $row = ($sql->num_rows > 0) ? $sql->fetch_assoc() : false;
    $sql->free();
    return $row;
  }
 
  function findNearestForeign($db, $lat, $lng) {
    // find nearest named place in Geonames database (http://www.geonames.org) for places
    // outside Australian continent
    $sql = $db->query("SELECT 'Geonames' as gazetteer, name, geonameid, feature_code, 
	    country_code AS state, latitude, longitude,
        (6371 * acos(cos(radians($lat)) * cos(radians(latitude)) *
        cos(radians(longitude) - radians($lng)) +
        sin(radians($lat)) * sin(radians(latitude)))) AS distance,
      MOD((DEGREES(ATAN2(SIN(RADIANS($lng-longitude)) * COS(RADIANS($lat)), COS(RADIANS(latitude)) * SIN(RADIANS($lat)) -
        SIN(RADIANS(latitude))*COS(RADIANS($lat))*COS(RADIANS($lng-longitude))))+360), 360) AS bearing
      FROM geonames.gazetteer
      WHERE feature_code = 'PPL'
        AND latitude BETWEEN ($lat - 1) AND ($lat + 1)
        AND longitude BETWEEN ($lng - 1) AND ($lng + 1)
      ORDER BY distance ASC
      LIMIT 1;");
 
    $row = ($sql->num_rows > 0) ? $sql->fetch_assoc() : false;
    $sql->free();
    return $row;
  }
 
  function createNnpTable($db) {
    // CREATE statement for abcd_nearnamedplace table
    $drop = $db->query('DROP TABLE IF EXISTS abcd_nearnamedplace');
    $create = $db->query("CREATE TABLE  abcd_nearnamedplace (
        unitId varchar(10) NOT NULL,
        nnp varchar(200) DEFAULT NULL,
        nnp_state varchar(5) DEFAULT NULL,
        nnp_source varchar(20) DEFAULT NULL,
        nnp_gazetteer varchar(10) DEFAULT NULL,
        nnp_id int(10) unsigned DEFAULT NULL,
        nnp_feature_code varchar(10) DEFAULT NULL,
        nnp_latitude float DEFAULT NULL,
        nnp_longitude float DEFAULT NULL,
        nnp_offset float DEFAULT NULL,
        nnp_bearing float DEFAULT NULL,
        npp_RelationTo varchar(15) DEFAULT NULL,
        date_created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
      ) ENGINE=MyISAM DEFAULT CHARSET=utf8;");
 
    if(!$create) {
      echo 'Error: Could not create table!';
      exit;
    }
  }
 
  function translateBearing($bearing) {
    // converts bearing into more human comprehensible language
    $ret = '';
    switch ($bearing) {
        case($bearing > 348.75 || $bearing <= 11.25):
            $ret = 'N';
        break;
        case($bearing <= 33.75):
            $ret = 'NNE';
        break;
        case($bearing <= 56.25):
            $ret = 'NE';
        break;
        case($bearing <= 78.75):
            $ret = 'ENE';
        break;
        case($bearing <= 101.25):
            $ret = 'E';
        break;
        case($bearing <= 123.75):
            $ret = 'ESE';
        break;
        case($bearing <= 146.25):
            $ret = 'SE';
        break;
        case($bearing <= 168.75):
            $ret = 'SSE';
        break;
        case($bearing <= 191.25):
            $ret = 'S';
        break;
        case($bearing <= 213.75):
            $ret = 'SSW';
        break;
        case($bearing <= 236.25):
            $ret = 'SW';
        break;
        case($bearing <= 258.75):
            $ret = 'WSW';
        break;
        case($bearing <= 281.25):
            $ret = 'W';
        break;
        case($bearing <= 303.75):
            $ret = 'WNW';
        break;
        case($bearing <= 326.25):
            $ret = 'NW';
        break;
        case($bearing <= 348.75):
            $ret = 'NNW';
        break;
    }  
    return $ret;
  }
 
  function insertNearest($db, $id, $nearest) {
    // inserts record into abcd_nearestnamedplace,
    // or updates existing record
 
    $insertdata = array('unitId' => $id,
      'nnp' => ucwords(strtolower($nearest['name'])),
      'nnp_state' => $nearest['state'],
      'nnp_source' => 'reverse geocoding',
      'nnp_gazetteer' => $nearest['gazetteer'],
      'nnp_id' => $nearest['geonameid'],
      'nnp_feature_code' => $nearest['feature_code'],
      'nnp_latitude' => sprintf('%.2f',$nearest['latitude']),
      'nnp_longitude' => sprintf('%.2f',$nearest['longitude']),
      'nnp_offset' => sprintf('%.1f',$nearest['distance']),
      'nnp_bearing' => sprintf('%.0f',$nearest['bearing']),
      'npp_RelationTo' => ($nearest['distance'] >= 0.5) ? round($nearest['distance']) . ' km ' 
	    . translateBearing($nearest['bearing']) : null
    );
 
    $columns = implode(', ', array_keys($insertdata));
 
    $values = array();
    foreach(array_values($insertdata) as $val) 
	  $values[] = is_numeric($val) ? $val : "'" . mysqli_real_escape_string($db, $val) . "'";
    $values = implode(', ', $values);
 
 
    $query = "INSERT INTO abcd_nearnamedplace ($columns) VALUES ($values)";
    $result = $db->query($query);
    $affected = $db->affected_rows;
 
    if($affected == 0) {
      $utime = date("Y-m-d H:i:s");
      $insertdata[] = array('date_created'=>$utime);
 
      $query = 'UPDATE abcd_nearnamedplace SET ';
      foreach($insertdata as $key=>$value) {
        $query .= $key . '=' . (is_numeric($value)) ? $value : "'" . mysqli_real_escape_string($db, $value) . "'";
      }
      $result = $db->query($query);
      $affected = $db->affected_rows;
    }
 
 
    return ($affected != 0) ? $affected : $insertdata;
  }
 
  function createIndexes($db) {
    // creates indexes for abcd_nearnamedplace table
    echo 'Creating indexes........' . "\n";
    $db->query("ALTER TABLE abcd_nearnamedplace ADD PRIMARY KEY (unitId),
       ADD INDEX nnp(nnp),
       ADD INDEX nnp_gazetteer(nnp_gazetteer),
       ADD INDEX nnp_feature_code(nnp_feature_code)");
  }
 
  function getMelisrRecords($db, $terms=false, $limit=false) {
    // turn term array into WHERE clause
    $where = array();
    foreach($terms as $term) {
      $operator = (isset($term['operator'])) ? $term['operator'] : '=';
      $field = $term['field'];
      $value = (is_numeric($term['value'])) ? $term['value'] : "'$term[value]'";
      $where[] = $field . ' ' . $operator . ' ' . $value;
    }
    $where = implode(' AND ', $where);
 
    $select = 'SELECT CONCAT(HERBKEY_2, HERBKEY_3) AS id, declat, declong, country
      FROM specimens
      WHERE declat Is Not Null AND declong Is Not Null AND (declat != 0 OR declong !=0)
        AND ' . $where;
 
    if($limit) $select .= 'LIMIT ' . $limit;
 
    // get records from melisr for which we want to find the nearest named place
    $sql = $db->query($select);
 
    return $sql;
  }
 
  function parseArguments($argv) {
    // parse arguments from command line; $argv[0] is the name of this script
    if(eregi('^[0-9]{4}-[0-9]{2}-[0-9]{2}$', $argv[1])) $datefrom = $argv[1];
    else {
      $timestamp=time();
      switch($argv[1]) {
        case 'daily': 
          $datefrom = date("Y-m-d", $timestamp-(24*60*60));
          break;
        case 'weekly': $datefrom = date("Y-m-d", $timestamp-(7*24*60*60));
          break;
        case 'monthly': $datefrom = date("Y-m-d", $timestamp-(31*24*60*60));
          break;
        case 'all': $datefrom = '0000-00-00';
          break;
        default:
          echo 'usage: php reversegeo.php daily | weekly | monthly';
          exit;
      }
    }
 
    $limit = (isset($argv[2]) && is_numeric($argv[2])) ? $argv[2] : false;
    return array($datefrom, $limit);
  }
 
  function selectMelisrRecords($datefrom) {
    // create array with query terms
    $where = array();
    $where[] = array('field'=>'dateedit', 'value'=>$datefrom, 'operator'=>'>');
    return $where;
  }
 
  function runReverse($argv) {
    // declare log array (will be written to log file)
    $log = array();
 
    // parse arguments given on the command line
    list($datefrom, $limit) = parseArguments($argv);
 
    // get start times and start logging to screen (and file)
    $start = time();
    echo 'script started ' . date("Y-m-d H:i:s") . "\n";
    $log[] = 'script started ' . date("Y-m-d H:i:s");
    echo 'Reverse geocode records edited after: ' . $datefrom . "\n";
    $log[] = 'Reverse geocode records edited after: ' . $datefrom;
 
    // connect to database
    $db = connectToDb();
 
    // if we do a full upload, re-create the abcd_nearnamedplace table
    if($datefrom == '0000-00-00') createNnpTable($db);
 
    // pass argument array to function to create array with query terms
    $terms = selectMelisrRecords($datefrom);
 
    // if the argument equals 'all', we only want continental Australian records,
    // so we push some extra terms into the array 
    /*if($argv[1] == 'all') {
      $terms[] = array('field'=>'country', 'value'=>'Australia');
      $terms[] = array('field'=>'declat', 'value'=>'-45', 'operator'=>'>');
      $terms[] = array('field'=>'declat', 'value'=>'-10', 'operator'=>'<');
      $terms[] = array('field'=>'declong', 'value'=>'112', 'operator'=>'>');
      $terms[] = array('field'=>'declong', 'value'=>'155', 'operator'=>'<');
    }
    */
 
    // get records from melisr for which we want to find the nearest named place
    $sql = getMelisrRecords($db, $terms, $limit);
 
    $numrecs = $sql->num_rows;
 
    if ($numrecs > 0) {
      echo $numrecs . ' records selected for reverse geocoding' . "\n";
      $log[] = $numrecs . ' records selected for reverse geocoding';
      $n = 0; $ins = 0; $upd = 0;
 
      // declare array for records that can not be inserted
      // these will be written to a file at the end of the function
      $inserterrors = array();
 
      if($datefrom != '0000-00-00') $mod = 100;
      else {
        switch($limit) {
          case ($limit < 101): 
            $mod = 10;
            break;
          case ($limit < 201): 
            $mod = 20;
            break;
          case ($limit < 501): 
            $mod = 50;
            break;
          case ($limit < 1001): 
            $mod = 100;
            break;
          case ($limit < 2001): 
            $mod = 200;
            break;
          case ($limit < 5001): 
            $mod = 500;
            break;
          default: 
            $mod = 1000;
        }
      }
 
      for($i=0; $i<$numrecs; $i++) {
        $melrecord = $sql->fetch_assoc();
        $id = $melrecord['id'];
        $lat = $melrecord['declat'];
        $lng = $melrecord['declong'];
 
        if($melrecord['country'] == 'AUSTRALIA' 
          && $melrecord['declat'] > -45 && $melrecord['declat'] < -10 
          && $melrecord['declong'] > 112 && $melrecord['declong'] < 155)
          $nearest = findNearestAu($db, $lat, $lng);
        else $nearest = findNearestForeign($db, $lat, $lng);
 
        if($nearest) {
          $insertnearest = insertNearest($db, $id, $nearest);
          if(!is_numeric($insertnearest)) {
            echo 'Error: MEL' . $id . ' could not be inserted!' . "\n";
            $log[] = 'Error: MEL' . $id . ' could not be inserted!';
            $inserterrors[] = $insertnearest;
          }
          else {
            // if the record is updated rather than inserted the number of 
            // affected rows equals 2
            if($insertnearest == 1) $ins++;
            else $upd++;
          }
        }
 
        $n++;
        if ($n % $mod == 0) {
          $t = time() - $start; // time in seconds the script has run
          $h = floor($t/3600); // hours
          $m = floor(($t-($h*3600))/60); // minutes
          $s = ($t-($m*60)-($h*3600)); // seconds
          $tstr = sprintf('%02d',$h) . ':';
          $tstr .= sprintf('%02d',$m) . ':';
          $tstr .= sprintf('%02d',$s);
          $tstr .= ' (' . $t . ' seconds)';
 
          echo "$n records processed ($ins inserted; $upd updated) in $tstr\n";
          $log[] = "$n records processed ($ins inserted; $upd updated) in $tstr";
        }
      }
      echo "$ins records were inserted in abcd_nearnamedplace\n";
      echo "$upd records were updated in abcd_nearnamedplace\n";
      $log[] = "A total of $ins records were inserted in abcd_nearnamedplace";
      echo count($inserterrors) . ' records could not be inserted; '
        . 'they have been written to \'inserterrors.txt\'' . "\n";
      $log[] = count($inserterrors) . ' records could not be inserted; '
        . 'they have been written to \'inserterrors.txt\'';
 
      $errorfile = fopen('/home/melisr/reversegeo/bin/inserterrors.txt', 'a');
      foreach($inserterrors as $err) {
        fwrite($errorfile, implode("\t", $err) . "\n");
      }
      fclose($errorfile);
    }
 
    // after full data upload we need to put the indexes back in
    // (insert queries run faster when there are no indexes (I think))
    if($datefrom == '0000-00-00') createIndexes($db);
 
    echo 'Script execution completed ' . date("Y-m-d H:i:s") . "\n";
    $log[] = 'Script execution completed ' . date("Y-m-d H:i:s");
 
    $logfile = fopen('/home/melisr/reversegeo/bin/nearnamedplace.log', 'a');
    fwrite($logfile, "\n\n<<<");
    fwrite($logfile, implode("\n", $log));
    fwrite($logfile, '>>>');
    fclose($logfile);
 
    $sql->free();
    $db->close();
  }
 
  runReverse($argv);
?>