Hadoop Kerberos Guide

July, 2015

Data is increasingly becoming one of the most valuable assets of any enterprise, so it is no surprise that controlling access to that data is a top priority. Given the importance of data management and the long legacy of systems supporting multiple types of authentication standards, most people are shocked to find out that Hadoop does not come out of the box with any data security whatsoever. This article will describe the basic concepts behind securing a Hadoop cluster with Kerberos and designing applications to interact with it.

Hadoop
Kerberos
Security

Expanding Your Hadoop Ecosystem

April, 2015

It may seem counterintuitive to spend so much time building out infrastructure that surrounds Hadoop, but it is estimated that up to 80% of the average data scientist’s time is spent on activities other than modeling. Any tools that help reduce that time will make data scientists more effective and will have a direct impact on the time to value of new Hadoop environments.

Hadoop
Data Science
Big Data
Infrastructure

Real Time Analytics

May, 2014

Many companies have realized the value of applying machine learning techniques to the data they've already collected in a batch oriented way, but a lot of them are still struggling to apply the same concepts to data as it's being generated in real time. Scoring data in real time is useful for any use case where making time sensitive decisions is important, such as in fraud detection or click stream analysis. This article describes a technique for using Alpine and Openscoring to build a model, deploy it, and programmatically use it to score new data as it is generated.

Alpine
Data Science
PMML
Internet of Things

Custom Alpine Operators

February, 2014

Alpine ships with a wide variety of operators for all phases of the analytics workflow: data loading, exploration, transformation, sampling, and modeling. All told there are over eighty built-in operators available out of the box. Even though this seems like a lot, users will always encounter edge cases where a certain operator doesn't quite work they way they want. Or maybe they have a novel approach to improve the performance or accuracy of an existing statistical model. Alpine provides a robust plugin framework so users can create their own operators to address these types of scenarios.

Alpine
Greenplum
Data Transformation

Alpine Sentiment Analysis

November, 2013

Although sentiment analysis has become more popular in recent years, it has yet to become an easy problem to solve. Text must be cleansed, parsed, and analyzed before a statistical model can be developed that is capable of automatically determining whether the writer was expressing a positive or negative opinion about a particular subject. The complexity of these tasks has proved daunting to most companies, but this article describes an approach for using Alpine, Greenplum, and GPText to easily create a sentiment analysis model.

Alpine
Greenplum
GPText
Data Science

Connecting SQLFire to Greenplum

August, 2013

Companies trying to Enable Big Data Science know that their strategy needs to include both a real time memory grid as well as a long term analytics platform. Many products on the market today fill these individual niches, but rolling data from the real time system to the analytics platform efficiently is often a pain point that companies struggle with. This guide will describe a method for moving data from SQLFire to Greenplum in the most reliable and efficient way possible.

Greenplum
SQLFire
RabbitMQ

Greenplum JSON Guide

July, 2013

New data storage formats are frequently being designed and developed. Greenplum allows developers to create Custom Data Formatters to read or write data in the storage format of their choosing. One example is the Greenplum JSON Formatter, which makes it easy to work with JSON data in the database without relying on custom data conversion scripts.

Greenplum
JSON
External Tables

Greenplum Kerberos Guide

May, 2013

Enabling Kerberos authentication in Greenplum is an easy way to provide database connectivity in your existing Single Sign-On infrastructure. It simplifies Greenplum configuration since passwords need not be managed by the database, and when combined with LDAP synchronization can greatly reduce DBA workloads. Further, applications and services in your organizaitons can delegate their authentication to Greenplum to simplify application development.

Greenplum
PostgreSQL
Kerberos

Character Encodings Demystified

March, 2013

If you've ever tried moving data between database systems then you've probably run into a character encoding issue. Encoding errors usually pop up during data loading and are hard to understand because the data looks correct in your old system and looks correct on disk, but it just wont load. Frustration often leads to sacrificing data integrity in order to get most of the data into the system. Fortunately with planning and a basic understanding of character encodings it is possible to achieve a perfect data load every time.

Data Encoding
Data Loading
Data Migration

Enabling Big Science

February, 2013

Collecting data is only half of the Big Science battle. Much like a carpenter needs a hammer and nails, your data scientists need the correct tools to effectively distill value from your data. All too often companies make the mistake of only considering how to store their data, when in reality storage is only one component of the big data tool belt. Ingestion, provisioning, cataloging, and querying must also be considered.

Big Data
Data Science

Putting Big Data to Work

February, 2013

Companies around the world have by and large bought into the benefits of Big Data. They are collecting and keeping more data than ever before, but most of the time all that Big Data is not earning its keep. Organizations need to put their big data to work by building more efficient Data Workflows.

Big Data
Data Workflow

Greenplum LDAP Guide

January, 2013

LDAP is a powerful way to centrally manage users and groups in an organization. Greenplum or PostgreSQL can be easily configured to take advantage of LDAP or Active Directory as an authentication source. Combined with an automated synchronization process, such as pg-ldap-sync, most user, group, and permission administration tasks no longer need to be handled by a DBA.

Greenplum
PostgreSQL
LDAP
pg-ldap-sync

Dillon Woods

Hadoop Kerberos Guide

Expanding Your Hadoop Ecosystem

Real Time Analytics

Custom Alpine Operators

Alpine Sentiment Analysis

Connecting SQLFire to Greenplum

Greenplum JSON Guide

Greenplum Kerberos Guide

Character Encodings Demystified

Enabling Big Science

Putting Big Data to Work

Greenplum LDAP Guide

Dillon Woods