Adding value to a data feed using YQL Execute

Filed under feature, tutorial

I want USGS earthquake data.  More specifically, I’m interested in recent, substantial quakes.  Fortunately, data.gov makes this data easy to find.  After searching through the USGS raw data catalog for the text “earthquake”, I choose the Worldwide M2.5+ Earthquakes, Past 7 Days feed, and pull it into YQL for parsing.  It’s almost perfect, but I want easy access to each quake’s magnitude, and the magnitude is buried in the “title” element.  No worries.  I’ll use YQL Execute to split it out and give it its own element in the feed’s structure.  I can then visualize this data using something like Jon LeBlanc’s js-yql-display project on github.

Here are a few reasons why YQL is perfect for this task:
1) I can take advantage of Yahoo!’s web-serving infrastructure to fetch, process, and cache the feed, reducing my server’s exposure and bandwidth costs.  My table is also cached, further reducing bandwidth usage.

2) Because YQL Execute employs standard E4X, I am using and adding to my JavaScript skill set, instead of spending time learning a new language

3) E4X was built specifically for XML manipulation so it has a convenient syntax for this job

4) By using YQL to do the heavy lifting, I can minimize the code I send to the browser and keep it focused on the display logic.

Ok. Ok. Here’s the code:

<?xml version="1.0" encoding="UTF-8"?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
  <meta>
        <description>Extracts magnitude from item title in atom feed and adds it as an element to the item.  We can then filter by magnitude using yql's built-in operators</description>
	<sampleQuery>select entry from usgs.earthquakes</sampleQuery>
	<sampleQuery>select entry from usgs.earthquakes where entry.magnitude >= 6.0</sampleQuery>

  </meta>
  <bindings>
    <select itemPath="" produces="XML">
		<urls>

			<url>http://earthquake.usgs.gov/eqcenter/catalogs/7day-M2.5.xml?11d</url>
		</urls>
		<execute><![CDATA[

			default xml namespace = "http://www.w3.org/2005/Atom";
			var xml = request.get().response,//call the url defined above

			 	entries = <entries></entries>,//prep the output object 
				entry = null,//individual entry in xml obj. used in loop below.
				magnitude = null;//magnitude of quake.  used in loop below

			for each(entry in xml.entry){
				magnitude = 
					entry.title//eg M 3.0, Puerto Rico region

					.split(' ')[1]//eg --> 3.0,
					.replace(',', '');//eg --> 3.0

				entry.appendChild( <magnitude>{magnitude}</magnitude> );
				entries.appendChild(entry);

			}
			response.object = entries;
		]]></execute>

    </select>
  </bindings>
</table>

Now, we can put this table on a server, load it up in YQL, and easily access the magnitude using YQL’s parser.

For those unfamiliar with E4X, it’s worth noting the namespace declaration (default xml namespace = "http://www.w3.org/2005/Atom";).  It tells YQL’s JavaScript engine what kind of structure to expect.  We wouldn’t be able to access the feed’s elements without it.  Find the namespaces associated with your data by looking in the xml wrapper:   The Atom namespace governs my feed’s structure as a whole, which is why it was convenient to declare it as a default.  For access to specific elements using another namespace, e.g. georss data, it’s be easier to define the namespace locally like this:
var ns = Namespace("http://www.georss.org/georss");
and then use it like this:
var latitude = xml.ns::Result.ns::Latitude;

Since we’ve gone to the trouble of defining a YQL table, we may as well add parsing for the summary element, which also contains some useful information in an inconvenient format.  Because this content is a bit more extensive, while still being somewhat predictable, a regular expression works well.  Here’s the code:

<?xml version="1.0" encoding="UTF-8"?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
  <meta>

	<description>Extracts magnitude from item title in atom feed and adds it as an element to the item.  We can then filter by magnitude using yql's built-in operators.  Additionally, it extracts summary cdata, parses it, wraps the parsed data in its own element, and adds this element to the xml output.  </description>
	<sampleQuery>select entry from usgs.earthquakes</sampleQuery>

	<sampleQuery>select entry.title, entry.updated, entry.link from usgs.earthquakes</sampleQuery>

	<sampleQuery>select entry.summary from usgs.earthquakes where entry.summary.type = "xml" and entry.summary.depth.km > 99</sampleQuery>

  </meta>
  <bindings>
    <select itemPath="" produces="XML">
		<urls>

			<url>http://earthquake.usgs.gov/eqcenter/catalogs/7day-M2.5.xml?11d</url>
		</urls>
		<execute><![CDATA[

			default xml namespace = "http://www.w3.org/2005/Atom";
			
			var xml = request.get().response,//call the url defined above

			 	entries = <entries></entries>,//prep the output object 
				entry = null,//individual entry in xml obj. used in loop below.
				magnitude = null,//magnitude of quake.  used in loop below

				re = '<img '//img tag opening bracket (note: trailing spaces here and below)
					+ 'src="(http://earthquake\\.usgs\\.gov/images/globes/[\\d_-]+\\.jpg)" '//img src - capture
					+ 'alt="([\\d\\.]+&#176;(?:N|S) [\\d\\.]+&#176;(?:W|E))" '//img alt - ignore (we already have coords from georss)

					+ 'align="(left|right)" '//img align - ignore
					+ 'hspace="(\\d+)" '//img hspace - ignore
					+ '/>'//img tag closing bracket
					+ '<p>'//opening p tag

					+ '(\\w+, \\w+\\s+\\d+, \\d+ [\\d:]+) UTC'//utc date - capture (note: variable amt of whitespace btwn month and day)
					+ '<br>'//br tag
					+ '(\\w+, \\w+\\s+\\d+, \\d+ [\\d:]+ (?:AM|PM)) at epicenter'//local date at epicenter - capture

					+ '</p>'//closing p tag
					+ '<p>'//opening p tag
					+ '<strong>Depth</strong>: '//descriptive text w/ strong tags

					+ '([\\d\\.]+) km '//depth in kilometers - capture
					+ '\\(([\\d\\.]+) mi\\)'//depth in miles (enclosed in parenthesis) - capture
					+ '</p>',//closing p tag

				cdata = null,
				summary = null;
			
			for each(entry in xml.entry){

				magnitude = 
					entry.title//eg M 3.0, Puerto Rico region
					.split(' ')[1]//eg --> 3.0,

					.replace(',', '');//eg --> 3.0
				entry.appendChild( <magnitude>{magnitude}</magnitude> );

				
				cdata = new RegExp(re).exec(entry.summary);

				summary = <summary type="xml"><!-- differentiate this summary obj from native summary obj w/ type 'html' -->

					<img alt={cdata[2]} align={cdata[3]} hspace={cdata[4]} src={cdata[1]} />

					<date>
						<utc>{cdata[5]}</utc>
						<local>{cdata[6]}</local>

					</date>
					<depth>
						<km>{cdata[7]}</km>

						<mi>{cdata[8]}</mi>
					</depth>
				</summary>;

				entry.appendChild(summary);
				
				entries.appendChild(entry);

				
			}
			response.object = entries;
		]]></execute>

    </select>
  </bindings>
</table>

Now we’re talking!  Check it out in the console.

Here are a couple implementation-level notes:
1) this code will generate an additional summary object, i.e., it doesn’t replace the pre-existing one.  If the later behavior is preferred, replace
entry.appendChild(summary);
with
entry.summary = summary;

2) the regular expression syntax used above is just the standard syntax for JavaScript, but be aware that the html is rendered using html entities, so the content I’m parsing using the regular expression looks different in the YQL console.  For example, add this as the first line inside the for loop:
y.log(entry.summary); 
This will print the cdata-wrapped html to the diagnostics section of the YQL output.  Instead of “<img src=”http://earthquake…-65.jpg” alt=”19.192&#176;N “, as we see in the raw xml feed, it looks like “&lt;img src=”http://earthquake…-65.jpg” alt=”19.192&amp;#176;N …”, On the server, it actually is the raw html, so the regular expression must be constructed accordingly.

To conclude, this post presents a couple ways to restructure a USGS data feed using YQL Execute so it’s more convenient to consume.  I’ve also given a couple tips for working with E4X and YQL.  Because YQL does the fetching, processing, and caching for me, my data delivery is speedy and my client-side code is light.