Changelog

Last Update: June 7, 2019

Protocol of the projects development history

1. June 2019: Geo Location Coordinates

The transformed data now contains geo location coordinates from the original ".GC" database field.

Download: v2019-06-01.b32a76a JSON

Indexed with ElasticSearch it can be used to query by distance of a given coordinate.

A typical question could be:

"What sound clips were recorded in a 500 km radius of 23° north and 112° west?"

Using curl and jq that question would translate into the following ElasticSearch query.

curl -s -XGET 'https://marine-mammal.soundwave.cl/es/_search?size=1' -H 'Content-Type: application/json' -d'{
  "query": {
    "filtered": {
      "filter": {
        "geo_distance": {
          "distance": "500km", 
          "location.coordinates": { 
            "lat": 23,
            "lon": -112
          }
        }
      }
    }
  }
}' | jq '.hits.hits[]._source.location'

With the result:

{
  "name": [
    "Magdalena Bay, Baja, California"
  ],
  "coordinates": [
    {
      "lat": 24,
      "lon": -112
    }
  ]
}

This lays the technical foundation to create a World Map with dynamic filters.

23. May 2019: Extending documentation

Added three pages to the website:

17. May 2019: World Map

Most of the 15254 sound clips can now be explored on an interactive World Map.

13910 of the database records contain 224 unique coordinates. They are entered in degree only, pointing to the wider region where the recording was made. More precise coordinates are not directly available.

Each dot on the World Map combines all recordings from that location. Clicking on the dot reveals a info window with basic record details and player for the sound clip. The title bar contains a pager with arrows to navigate through the recordings. The pager can also be clicked to show a list of all recordings.

The map is implemented using the open source ArcGIS JavaScript library with a standard GeoJSON file as data source. The GeoJSON file is also available for download.

To create the GeoJSON file run the following command in the source code tree.

./GeoJSON.jq data/rn/*json > wmmsdb.geojson

25. July 2018: Adding search and filter by location description

Example Queries (TODO)

24. July 2018: Analyzing Watkins sound clips for acoustic features

Enriching Watkins sound database, is one opportunity to be explored with this project. The contents of each sound clip are well described by the database, but there is nothing providing insight into the actual signal characteristics. Even simple properties like clip duration are not available.

Automatically analyzing ~15.000 sound clips, might not have been an option with affordable PC hardware in the 1990s, but any current day machine can handle this task in reasonable time.

Method

In January 1992 Kurt Fristrup and William Watkins published a report, on a software tool "Characterizing acoustic features of marine animal sounds".

The sound measures included statistics for Aggregate Bandwidth, Intensity, Duration, Amplitude Modulation, Frequency Modulation, Short-term Bandwidth, Center Frequency, and Amplitude Frequency Interaction.

The paper further asks:

Do the differences in these sound features remain distinctive as the scope of comparison widens? With our own ears, we can often distinguish acoustic features that appear to be species-specific, and sometimes features unique to individual animals; can we specify numerical algorithms that objectively recognize these distinctions?

The mentioned "Aggregated Bandwidth" or "aggregate power spectrum" function, exists in the popular seewave library, written in the statistical computing language R by Jerôme Sueur.

The manual states:

acoustat was originally developed in Matlab language by Fristrup and Watkins (1992). The R function was kindly checked by Kurt Fristrup.

There are a few things to note, on the default acoustat() implementation parameters, compared to the workflow described in the paper.

For now the acoustat() parameters are left to the defaults.

Implementation

Downloading all sound clips is left as exercise for the reader. Please be reasonable and don't overload the WHOI server.

The remaining job is fairly simple: load the signal, run the statistics and store the result as JSON files, for further indexing in ElasticSearch.

Running the analysis effectively requires a task management tool. It keeps track of the progress, can resume a aborted run and allows parallel execution of tasks. A bash script can run the task in parallel but GNU Make provides a clear state; how far the processing of all sound clips has progressed. It does that by keeping track of input and output files. If an output JSON file does not exists, the job is not done.

This simplified Makefile defines *.wav files as INPUTS and *.acoustat.json as ACOUSTAT outputs using acoustat.json.r as job processor.

DIR = $(abspath .)
INPUTS = $(wildcard $(DIR)/*.wav)

ACOUSTAT = $(patsubst $(DIR)/%.wav,$(DIR)/%.acoustat.json,$(INPUTS))

acoustat: $(ACOUSTAT)

$(DIR)/%.acoustat.json: $(DIR)/%.wav
    ./acoustat.json.r $< $@ $(*F)

acoustat.json.r is called with 3 parameters. The input wav filename $< the output JSON filename $@ and the stem $(*F) or basename of the input file. The last parameter equals the record number and is added as id in the output JSON file. This is needed to map the JSON data to the correct ElasticSearch document id.

The R script acoustat.json.r is straight forward to implement.

#!/usr/bin/Rscript --vanilla
suppressPackageStartupMessages({
library("seewave")
library("tuneR")
library("jsonlite")
library("methods")
})
argv = commandArgs(trailingOnly = TRUE)
wav = tuneR::readWave(argv[1])
stat = seewave::acoustat(wave=wav, plot = FALSE)
# remove unwanted contour data
stat$freq.contour <- NULL
stat$time.contour <- NULL
# assign record number as id
stat$id <- argv[3]
write_json(stat, argv[2])

The first line allows execution as a shell script and ensures a clean R environment. To avoid cluttered output during execution, various library messages are silenced.

Execution

All sound files, Makefile and acoustat.json.r script are placed in the same directory and the following command runs the analysis with 8 parallel processes. The number should equal the number of available CPU cores.

make -j 8

A single output JSON file looks like this, with P1 and P2 being the lower and upper estimates mentioned in Fristrup and Watkins (1992 ) report, chapter 2.6 "Aggregate Bandwidth".

{
  "time.P1": [
    0.1157
  ],
  "time.M": [
    1.0284
  ],
  "time.P2": [
    1.8254
  ],
  "time.IPR": [
    1.7098
  ],
  "freq.P1": [
    1.5625
  ],
  "freq.M": [
    11.25
  ],
  "freq.P2": [
    28.4375
  ],
  "freq.IPR": [
    26.875
  ],
  "id": [
    "71004003"
  ]
}

The meaning of each value is also documented in the acoustat manual.

Indexing

As a last step the raw values are mapped under .sound.freq and .sound.time of the existing JSON document tree.

A acoustat.jq script transforms the JSON data for the ElasticSearch bulk import API.

{ 
    update: {
        _index: "wmmsdb", 
        _type: "record", 
        _id: .id[0] 
    } 
},
{
    doc: {
        sound: {
            freq: {
                IPR: .["freq.IPR"][0],
                M: .["freq.M"][0],
                P1: .["freq.P1"][0],
                P2: .["freq.P2"][0]
            },
            time: {
                IPR: .["time.IPR"][0],
                M: .["time.M"][0],
                P1: .["time.P1"][0],
                P2: .["time.P2"][0]
            }
        }
    }
}

Finally the data is added to ElasticSearch using the following command.

jq --raw-output --compact-output -f acoustat.jq *.acoustat.json | curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary "@-" | jq .took

Results and Discussion

With small changes to the existing Web UI the acoustic features are available as search filters, but how can they be used during research?

The 1992 Fristrup and Watkins report outlines the design of the features.

Each statistic was designed to emphasize particular parameters of animal sounds that we recognized as important for distinguishing species.

The paper further explains a correlation test with a subset of 200 sounds clips, to see if species could be distinguished using the statistical features.

It notes:

The short-term bandwidth statistics in Table 5, the aggregate bandwidth statistics in Table 6, and the center frequency statistics of Table 7 were the most diagnostic for this set of sound sequences. They apparently separated the sounds of different species.

Due to the mentioned differences in parameters and workflow (and probably sample size), the values from Table 6 don't translate 1:1 on the current results. A correlation test over the current full result set, might highlight the exact values to distinguish between species.

For now the technical workflow, of processing all sound clips effectively, is established. More refined parameters and methods are to be explored in the future.

References

Fristrup, K. M. and Watkins, W. A. 1992. Characterizing acoustic features of marine animal sounds. Woods Hole Oceanographic Institution Technical Report WHOI-92-04.

20. July 2018: First release

Notes on Implementation (TODO)

6. July 2018: Start of the project

(TODO)