Data is increasingly becoming one of the most valuable assets of any enterprise, so it is no surprise that controlling access to that data is a top priority. Given the importance of data management and the long legacy of systems supporting multiple types of authentication standards, most people are shocked to find out that Hadoop does not come out of the box with any data security whatsoever. This article will describe the basic concepts behind securing a Hadoop cluster with Kerberos and designing applications to interact with it.
It may seem counterintuitive to spend so much time building out infrastructure that surrounds Hadoop, but it is estimated that up to 80% of the average data scientist’s time is spent on activities other than modeling. Any tools that help reduce that time will make data scientists more effective and will have a direct impact on the time to value of new Hadoop environments.
Many companies have realized the value of applying machine learning techniques to the data they've already collected in a batch oriented way, but a lot of them are still struggling to apply the same concepts to data as it's being generated in real time. Scoring data in real time is useful for any use case where making time sensitive decisions is important, such as in fraud detection or click stream analysis. This article describes a technique for using Alpine and Openscoring to build a model, deploy it, and programmatically use it to score new data as it is generated.
Alpine ships with a wide variety of operators for all phases of the analytics workflow: data loading, exploration, transformation, sampling, and modeling. All told there are over eighty built-in operators available out of the box. Even though this seems like a lot, users will always encounter edge cases where a certain operator doesn't quite work they way they want. Or maybe they have a novel approach to improve the performance or accuracy of an existing statistical model. Alpine provides a robust plugin framework so users can create their own operators to address these types of scenarios.
Although sentiment analysis has become more popular in recent years, it has yet to become an easy problem to solve. Text must be cleansed, parsed, and analyzed before a statistical model can be developed that is capable of automatically determining whether the writer was expressing a positive or negative opinion about a particular subject. The complexity of these tasks has proved daunting to most companies, but this article describes an approach for using Alpine, Greenplum, and GPText to easily create a sentiment analysis model.
Companies trying to Enable Big Data Science know that their strategy needs to include both a real time memory grid as well as a long term analytics platform. Many products on the market today fill these individual niches, but rolling data from the real time system to the analytics platform efficiently is often a pain point that companies struggle with. This guide will describe a method for moving data from SQLFire to Greenplum in the most reliable and efficient way possible.
New data storage formats are frequently being designed and developed. Greenplum allows developers to create Custom Data Formatters to read or write data in the storage format of their choosing. One example is the Greenplum JSON Formatter, which makes it easy to work with JSON data in the database without relying on custom data conversion scripts.
Enabling Kerberos authentication in Greenplum is an easy way to provide database connectivity in your existing Single Sign-On infrastructure. It simplifies Greenplum configuration since passwords need not be managed by the database, and when combined with LDAP synchronization can greatly reduce DBA workloads. Further, applications and services in your organizaitons can delegate their authentication to Greenplum to simplify application development.
If you've ever tried moving data between database systems then you've probably run into a character encoding issue. Encoding errors usually pop up during data loading and are hard to understand because the data looks correct in your old system and looks correct on disk, but it just wont load. Frustration often leads to sacrificing data integrity in order to get most of the data into the system. Fortunately with planning and a basic understanding of character encodings it is possible to achieve a perfect data load every time.
Collecting data is only half of the Big Science battle. Much like a carpenter needs a hammer and nails, your data scientists need the correct tools to effectively distill value from your data. All too often companies make the mistake of only considering how to store their data, when in reality storage is only one component of the big data tool belt. Ingestion, provisioning, cataloging, and querying must also be considered.
Companies around the world have by and large bought into the benefits of Big Data. They are collecting and keeping more data than ever before, but most of the time all that Big Data is not earning its keep. Organizations need to put their big data to work by building more efficient Data Workflows.
LDAP is a powerful way to centrally manage users and groups in an organization. Greenplum or PostgreSQL can be easily configured to take advantage of LDAP or Active Directory as an authentication source. Combined with an automated synchronization process, such as pg-ldap-sync, most user, group, and permission administration tasks no longer need to be handled by a DBA.