Skip to content
This repository has been archived by the owner on May 18, 2023. It is now read-only.

Commit

Permalink
Updating README to include an example which is up-to-date and will wo…
Browse files Browse the repository at this point in the history
…rk on a vanilla Spark cluster outside of the Databricks environment [skip ci] Fixes #12
  • Loading branch information
Ghnuberath committed Oct 8, 2015
1 parent 162b46b commit 634a84b
Show file tree
Hide file tree
Showing 2 changed files with 71 additions and 34 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ target/
.idea/
*.iml
.vagrant
metastore_db/
*.log
103 changes: 69 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,46 @@
# Mosaic
> Smaller tiles.
# Building/Installing
## Usage

```
$ ./gradlew jar
$ ./gradlew install
```
### Source data

# Testing
We need something to tile. Let's start with a small sample of the [NYC Taxi Dataset](http://www.andresmh.com/nyctaxitrips/), which can be [downloaded from here](http://assets.oculusinfo.com/pantera/taxi_micro.csv).

Since testing Mosaic requires a Spark cluster, a containerized test environment is included via [Docker](https://www.docker.com/). If you have docker installed, you can build and test Mosaic within that environment:
### Tile generation, made easy!

```bash
$ docker build -t docker.uncharted.software/mosaic-test .
$ docker run --rm docker.uncharted.software/mosaic-test
```
Let's generate tiles which represent the mean number of passengers at each pickup location in the source dataset.

The above commands trigger a one-off build and test of Mosaic. If you want to interactively test Mosaic while developing (without having to re-run the container), use the following commands:
To begin, we'll need a spark-shell. If you have your own Spark cluster, skip ahead to [Generation](#example-generation). Otherwise, continue to the next step to fire up a small Spark test cluster via [Docker](https://www.docker.com/). You'll want at least 4GB of free RAM on your machine to use this latter method.

#### Using the Docker test container

Since running Mosaic requires a Spark cluster, a containerized test environment is included via [Docker](https://www.docker.com/). If you have docker installed, you can run the following example within that containerized environment.

Build and fire up the container with a shell:

```bash
$ docker build -t docker.uncharted.software/mosaic-test .
$ docker run -v $(pwd):/opt/mosaic -it docker.uncharted.software/mosaic-test bash
# then, inside the running container
$ ./gradlew
```

This will mount the code directory into the container as a volume, allowing you to make code changes on your host machine and test them on-the-fly.

# Tiling
Now, inside the container, build and install Mosaic:

## Data
```bash
$ ./gradlew install
```

We need something to tile. Let's start with a small sample of the [NYC Taxi Dataset](http://www.andresmh.com/nyctaxitrips/), which can be [downloaded from here](http://assets.oculusinfo.com/pantera/taxi_micro.csv).
Keep the container running! We'll need it to try the following example.

Create a temp table called "taxi_micro" with the following schema:
#### <a name="example-generation"></a>Generation

```scala
scala> sqlContext.sql("select * from taxi_micro").schema
Launch a spark-shell. We'll be using mosaic, and a popular csv->DataFrame library for this example:

res5: org.apache.spark.sql.types.StructType = StructType(StructField(hack,StringType,true), StructField(license,StringType,true), StructField(code,StringType,true), StructField(flag,IntegerType,true), StructField(type,StringType,true), StructField(pickup_time,TimestampType,true), StructField(dropoff_time,TimestampType,true), StructField(passengers,IntegerType,true), StructField(duration,IntegerType,true), StructField(distance,DoubleType,true), StructField(pickup_lon,DoubleType,true), StructField(pickup_lat,DoubleType,true), StructField(dropoff_lon,DoubleType,true), StructField(dropoff_lat,DoubleType,true))
```bash
$ spark-shell --packages "com.databricks:spark-csv_2.10:1.2.0,com.unchartedsoftware.mosaic:mosaic-core:0.11.0"
```

## Tile Generation

Let's generate tiles which represent the mean number of passengers at each pickup location in the source dataset.
Now it's time to run a simple tiling job! Enter paste mode (:paste), and paste the following script:

```scala
import com.unchartedsoftware.mosaic.core.projection.numeric._
Expand All @@ -57,8 +53,17 @@ import java.sql.Timestamp
import org.apache.spark.sql.Row

// source RDD
// NOTE: It is STRONGLY recommended that you filter your input RDD down to only the columns you need for tiling.
val rdd = sqlContext.sql("select pickup_lon, pickup_lat, passengers from taxi_micro").rdd
// It is STRONGLY recommended that you filter your input RDD
// down to only the columns you need for tiling.
val rdd = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("file:///taxi_micro.csv") // be sure to update the file path to reflect
// the download location of taxi_micro.csv
.select("pickup_lon", "pickup_lat", "passengers")
.rdd

// cache the RDD to make things a bit faster
rdd.cache

// We use a value extractor function to retrieve data-space coordinates from rows
Expand All @@ -67,7 +72,7 @@ val cExtractor = (r: Row) => {
if (r.isNullAt(0) || r.isNullAt(1)) {
None
} else {
Some(r.getDouble(0), r.getDouble(1)))
Some((r.getDouble(0), r.getDouble(1)))
}
}

Expand Down Expand Up @@ -106,16 +111,22 @@ val result = gen.generate(rdd, Seq(series), request)
result.map(t => (t(0).coords, t(0).bins)).collect
```

# Mosaic Library Contents
## Mosaic Library Contents

Mosaic is made of some simple, but vital, components:

## Projections
### Projections

A projection maps from data space to the tile coordinate space.

Mosaic currently supports three projections:
* CartesianProjection (x, y, v)
* MercatorProjection (x, y, v)
* SeriesProjection (x, v)

## Aggregators
### Aggregators

Aggregators are used to aggregate values within a bin or a tile.

Mosaic includes seven sample aggregators:

Expand All @@ -129,13 +140,37 @@ Mosaic includes seven sample aggregators:

Additional aggregators can be implemented on-the-fly within your script as you see fit.

## Requests
### Requests

Mosaic allows tile batches to be phrased in several ways:

* TileSeqRequest (built from a Seq[TC] of tile coordinates, requesting specific tiles)
* TileLevelRequest (built from a Seq[Int] of levels, requesting all tiles at those levels)

## Serialization
### Series

A Series pairs together a Projection with Aggregators. Multiple Series can be generated simultaneously, each operating on the source data in tandem.

### Serialization

Mosaic currently supports serializing tiles consisting of basic type values to Apache Avro which is fully compliant with the aperture-tiles sparse/dense schemas. This functionality is provided in a separate package called mosaic-avro-serializer.

## Testing

Since testing Mosaic requires a Spark cluster, a containerized test environment is included via [Docker](https://www.docker.com/). If you have docker installed, you can build and test Mosaic within that environment:

```bash
$ docker build -t docker.uncharted.software/mosaic-test .
$ docker run --rm docker.uncharted.software/mosaic-test
```

The above commands trigger a one-off build and test of Mosaic. If you want to interactively test Mosaic while developing (without having to re-run the container), use the following commands:

```bash
$ docker build -t docker.uncharted.software/mosaic-test .
$ docker run -v $(pwd):/opt/mosaic -it docker.uncharted.software/mosaic-test bash
# then, inside the running container
$ ./gradlew
```

This will mount the code directory into the container as a volume, allowing you to make code changes on your host machine and test them on-the-fly.

0 comments on commit 634a84b

Please sign in to comment.