Все вопросы: [hadoop]

54 вопросов

18
голосов
8ответов
52751 просмотров

Writing data to Hadoop

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me th...

59
голосов
7ответов
3142 просмотров

Life without JOINs... understanding, and common practices

Lots of "BAW"s (big ass-websites) are using data storage and retrieval techniques that rely on huge tables with indexes, and using queries that won't/can't use JOINs in their queries (BigTable, HQL, etc) to deal with scalability and sharding databases. How does that work when you have lots and lo...

2
голосов
3ответов
6256 просмотров

Sorting the values before they are send to the reducer

I'm thinking about building a small testing application in hadoop to get the hang of the system. The application I have in mind will be in the realm of doing statistics. I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number ...

3
голосов
2ответов
1165 просмотров

Setting up a (Linux) Hadoop cluster

Do you need to set up a Linux cluster first in order to setup a Hadoop cluster ?

8
голосов
3ответов
7958 просмотров

Get the task attempt ID for the currently running Hadoop task

The Task Side-Effect Files section of the Hadoop tutorial mentions using the "attemptid" of the task as a unique name. How do I get this attempt ID in my mapper or reducer?

5
голосов
2ответов
1814 просмотров

CloudStore vs. HDFS

Does anyone have any familiarity with working with both CloudStore and HDFS. I am interested to see how far CloudStore has been scaled and how heavily it has been used in production. CloudStore seems to be more full featured than HDFS. When thinking about these two filesystems what practical trad...

53
голосов
3ответов
39956 просмотров

Java vs Python on Hadoop

I am working on a project using Hadoop and it seems to natively incorporate Java and provide streaming support for Python. Is there is a significant performance impact to choosing one over the other? I am early enough in the process where I can go either way if there is a significant performance...

8
голосов
4ответов
6442 просмотров

Advanced queries in HBase

Given the following HBase schema scenario (from the official FAQ)... How would you design an Hbase table for many-to-many association between two entities, for example Student and Course? I would define two tables: Student: student id student data (name, address, ...) course...

0
голосов
1ответов
163 просмотров

Look up values in a BDB for several files in parallel

What is the most efficient way to look up values in a BDB for several files in parallel? If I had a Perl script which did this for one file at a time, would forking/running the process in background with the ampersand in Linux work? How might Hadoop be used to solve this problem? Would threadin...

25
голосов
5ответов
7907 просмотров

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL. Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimen...

20
голосов
4ответов
2587 просмотров

Hadoop Distribution Differences

Can somebody outline the various differences between the various Hadoop Distributions available: Cloudera - http://www.cloudera.com/hadoop Yahoo - http://developer.yahoo.net/blogs/hadoop/ using the Apache Hadoop distro as a baseline. Is there a good reason to using one of these distrib...

14
голосов
7ответов
17541 просмотров

Wiping out DFS in Hadoop

How do I wipe out the DFS in Hadoop?

5
голосов
3ответов
10789 просмотров

Splitting input into substrings in PIG (Hadoop)

Assume I have the following input in Pig: some And I would like to convert that into: s so som some I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries. So can "pig latin" do this or is this something that...

5
голосов
2ответов
7739 просмотров

Using the Apache Mahout machine learning libraries

I've been working with the Apache Mahout machine learning libaries in my free time a bit over the past few weeks. I'm curious to hear about how others are using these libraries.

0
голосов
1ответов
91 просмотров

Distributing Video on a LAN to alternate Locations - Can the browser detect this?

I'm the administrator for a company intranet and I'd like to start producing videos. However, we have a very small bandwidth tunnel between our locations, and I'd like to avoid hogging it by streaming videos by multiple users. I'd like to synchronize the files to servers at each of the locations...

1
голосов
2ответов
982 просмотров

Hadoop DFS Permission Error

2009/08/11 13:25:39 [INFO] - put: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=yskhoo, access=WRITE, inode="":bad-boy:supergroup:rwxr-xr-x Why do I keep getting this error when I try to put some files from my LFS to HDFS?

0
голосов
2ответов
330 просмотров

Hadoop DFS Error when coping file from local to hdfs

Can someone tell me what I am doing wrong? 2009/08/10 11:33:07 [INFO] - Copying local:/ X/Y/Z.txt to DFS:/X/Y/Z.txt 2009/08/10 11:33:07 [INFO] - put: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=superman, access=WRITE, inode="":big-build:supergroup:rwxr-x...

2
голосов
3ответов
1536 просмотров

How Can I Use The Input Logs .PCAP(Binary) With Map Rreduce Hadoop

Logs Tcpdumps are binary files, I want to know what FileInputFormat of hadoop I should use for split chunks the input data...please help me!!

0
голосов
1ответов
1322 просмотров

Hadoop Input Files

Is there a difference between having say n files with 1 line each in the input folder and having 1 file with n lines in the input folder when running hadoop? If there are n files, does the "InputFormat" just see it all as 1 continuous file?

13
голосов
10ответов
9500 просмотров

Streaming data and Hadoop? (not Hadoop Streaming)

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it a...

2
голосов
5ответов
442 просмотров

Dealing with Gigabytes of Data

I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows: Lots of writes and Lots of reads on same tables, very realtime Scaling is very important as the...

1
голосов
1ответов
887 просмотров

Java Generics & Hadoop: how to get a class variable

I'm a .NET programmer doing some Hadoop work in Java and I'm kind of lost here. In Hadoop I am trying to setup a Map-Reduce job where the output key of the Map job is of the type Tuple<IntWritable,Text>. When I set the output key using setOutputKeyclass as follows JobConf conf2 = new JobCo...

110
голосов
4ответов
61059 просмотров

How does the MapReduce sort algorithm work?

One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment. To me sorting simply involves determining the relative position of an element in relat...

1
голосов
1ответов
2957 просмотров

Copy ResultSet without using CachedRowSetImpl.execute()

I'm trying to close the connection after executing a query. Before, I just create a CachedRowSetImpl instance and it will take care of release the resources for me. However, I am using Hive database driver from Hadoop project. It doesn't support CachedRowSetImpl.execute(). I'm wondering if is the...

-2
голосов
5ответов
1434 просмотров

hadoop hive question

I'm trying to create tables pragmatically using JDBC. However, I can't really see the table I created from the hive shell. What's worse, when i access hive shell from different directories, i see different result of the database. Is any setting i need to configure? Thanks in advance.

3
голосов
2ответов
1532 просмотров

Распределенный сканер HBase

В «Пример использования API» на странице «Начало работы» в документации HBase есть пример использования сканера: Scanner scanner = table.getScanner (новый String [] {"myColumnFamily: columnQualifier1"}); RowResult rowResult = scanner.next(); while (rowResult != null) { //... rowR...

3
голосов
3ответов
2248 просмотров

Обработка файлов с заголовками в Hadoop

Я хочу обрабатывать много файлов в Hadoop - каждый файл имеет некоторую информацию заголовка, за которой следует множество записей, каждая из которых хранится в фиксированном количестве байтов. Есть предложения по этому поводу?

2
голосов
4ответов
4057 просмотров

получение данных в HADoop и из него

Мне нужна система для анализа больших файлов журналов. На днях друг посоветовал мне сделать хадуп, и он кажется идеальным для моих нужд. Мой вопрос вращается вокруг передачи данных в hadoop- Возможно ли, чтобы узлы в моем кластере передавали данные потока, когда они передают их в HDFS? Или ка...

9
голосов
1ответов
2436 просмотров

HBase стабильна и готова к работе?

Считаете ли вы, что люди, развернувшие HBase на своих кластерах, достаточно стабильны для производственного использования? С какими типами проблем или проблем вы сталкивались? Я действительно вижу несколько компаний, которые используют HBase в производстве ( http: // wiki .apache.org / hadoop...

3
голосов
2ответов
6394 просмотров

Как установить Priority \ Pool для задания потоковой передачи Hadoop?

Как установить приоритет \ пул задания потоковой передачи Hadoop? Вероятно, это параметр командной строки jobconf (например, -jobconf something = pool.name), но мне не удалось найти какую-либо документацию в этом Интернете ...