hadoop学习心得

第一篇：hadoop学习心得

1.FileInputFormat splits only large files.Here “large” means larger than an HDFS block.The split size is normally the size of an HDFS block, which is appropriate for most applications;however,it is possible to control this value by setting various Hadoop properties.2.So the split size is blockSize.3.Making the minimum split size greater than the block size increases the split size, but at the cost of locality.4.One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file.If the file is very small(“small” means significantly smaller than an HDFS block)and there are a lot of them, then each map task will process very little input, and there will be a lot of them(one per file), each of which imposes extra bookkeeping overhead.hadoop处理大量小数据文件效果不好:

hadoop对数据的处理是分块处理的，默认是64M分为一个数据块，如果存在大量小数据文件（例如：2-3M一个的文件）这样的小数据文件远远不到一个数据块的大小就要按一个数据块来进行处理。

这样处理带来的后果由两个：1.存储大量小文件占据存储空间，致使存储效率不高检索速度也比大文件慢。

2.在进行MapReduce运算的时候这样的小文件消费计算能力，默认是按块来分配Map任务的（这个应该是使用小文件的主要缺点）

那么如何解决这个问题呢？

1.使用Hadoop提供的Har文件，Hadoop命令手册中有可以对小文件进行归档。2.自己对数据进行处理，把若干小文件存储成超过64M的大文件。

FileInputFormat is the base class for all implementations of InputFormat that use files as their data source(see Figure 7-2).It provides two things: a place to define which files are included as the input to a job, and an implementation for generating splits for the input files.The job of dividing splits into records is performed by subclasses.An InputSplit has a length in bytes, and a set of storage locations, which are just hostname strings.Notice that a split doesn’t contain the input data;it is just a reference to the data.As a MapReduce application writer, you don’t need to deal with InputSplits directly, as they are created by an InputFormat.An InputFormat is responsible for creating the input splits, and dividing them into records.Before we see some concrete examples of InputFormat, let’s briefly examine how it is used in MapReduce.Here’s the interface:

public interface InputFormat { InputSplit[] getSplits(JobConf job, int numSplits)throws IOException;RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter)throws IOException;}

The JobClient calls the getSplits()method.On a tasktracker, the map task passes the split to the getRecordReader()method on InputFormat to obtain a RecordReader for that split.A related requirement that sometimes crops up is for mappers to have access to the full contents of a file.Not splitting the file gets you part of the way there, but you also need to have a RecordReader that delivers the file contents as the value of the record.One reason for this is that FileInputFormat generates splits in such a way that each split is all or part of a single file.If the file is very small(“small” means significantly smaller than an HDFS block)and there are a lot of them, then each map task will process very little input, and there will be a lot of them(one per file), each of which imposes extra bookkeeping overhead.Example 7-2.An InputFormat for reading a whole file as a record public class WholeFileInputFormat extends FileInputFormat { @Override protected boolean isSplitable(FileSystem fs, Path filename){ return false;} @Override public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter)throws IOException { return new WholeFileRecordReader((FileSplit)split, job);} } We implement getRecordReader()to return a custom implementation of RecordReader.Example 7-3.The RecordReader used by WholeFileInputFormat for reading a whole file as a record class WholeFileRecordReader implements RecordReader { private FileSplit fileSplit;private Configuration conf;private boolean processed = false;public WholeFileRecordReader(FileSplit fileSplit, Configuration conf)throws IOException { this.fileSplit = fileSplit;this.conf = conf;} @Override public NullWritable createKey(){ return NullWritable.get();} @Override public BytesWritable createValue(){ return new BytesWritable();} @Override public long getPos()throws IOException { return processed ? fileSplit.getLength(): 0;} @Override public float getProgress()throws IOException { return processed ? 1.0f : 0.0f;} @Override public boolean next(NullWritable key, BytesWritable value)throws IOException { if(!processed){ byte[] contents = new byte[(int)fileSplit.getLength()];Path file = fileSplit.getPath();FileSystem fs = file.getFileSystem(conf);FSDataInputStream in = null;try { in = fs.open(file);IOUtils.readFully(in, contents, 0, contents.length);value.set(contents, 0, contents.length);} finally { IOUtils.closeStream(in);} processed = true;return true;} return false;} @Override public void close()throws IOException { // do nothing } }

Input splits are represented by the Java interface, InputSplit(which, like all of the classes mentioned in this section, is in the org.apache.hadoop.mapred package†): public interface InputSplit extends Writable { long getLength()throws IOException;String[] getLocations()throws IOException;}

An InputSplit has a length in bytes, and a set of storage locations, which are just hostname strings.Notice that a split doesn’t contain the input data;it is just a reference to the data.The storage locations are used by the MapReduce system to place map tasks as close to the split’s data as possible, and the size is used to order the splits so that the largest get processed first, in an attempt to minimize the job runtime(this is an instance of a greedy approximation algorithm).As a MapReduce application writer, you don’t need to deal with InputSplits directly, as they are created by an InputFormat.An InputFormat is responsible for creating the input splits, and dividing them into records.Before we see some concrete examples of InputFormat, let’s briefly examine how it is used in MapReduce.Here’s the interface:

public interface InputFormat { InputSplit[] getSplits(JobConf job, int numSplits)throws IOException;RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter)throws IOException;}

Having calculated the splits, the client sends them to the jobtracker, which uses their storage locations to schedule map tasks to process them on the tasktrackers.A path may represent a file, a directory, or, by using a glob, a collection of files and directories.A path representing a directory includes all the files in the directory as input to the job.See “File patterns” on page 60 for more on using globs.It is a common requirement to process sets of files in a single operation.For example, a MapReduce job for log processing might analyze a month worth of files, contained in a number of directories.Rather than having to enumerate each file and directory to specify the input, it is convenient to use wildcard characters to match multiple files with a single expression, an operation that is known as globbing.Hadoop provides two FileSystem methods for processing globs: public FileStatus[] globStatus(Path pathPattern)throws IOException public FileStatus[] globStatus(Path pathPattern, PathFilter filter)throws IOException

第二篇：Hadoop之JobTrack分析

Hadoop之JobTrack分析

1.client端指定Job的各种参数配置之后调用job.waitForCompletion(true)方法提交Job给JobTracker，等待Job 完成。

[java] view plaincopyprint?

1.public void submit()throws IOException, InterruptedException, 2.ClassNotFoundException { 3.ensureState(JobState.DEFINE);//检查JobState状态

4.setUseNewAPI();//检查及设置是否使用新的MapReduce API

5.6.// Connect to the JobTracker and submit the job

7.connect();//链接JobTracker

8.info = jobClient.submitJobInternal(conf);//将job信息提交

9.super.setJobID(info.getID());

10.state = JobState.RUNNING;//更改job状态

11.}

以上代码主要有两步骤,连接JobTracker并提交Job信息。connect方法主要是实例化JobClient对象，包括设置JobConf和init工作：

[java] view plaincopyprint?

1.public void init(JobConf conf)throws IOException {

2.String tracker = conf.get(“mapred.job.tracker”, “local”);//读取配置文件信息用于判断该Job是运行于本地单机模式还是分布式模式

3.tasklogtimeout = conf.getInt（4.TASKLOG_PULL_TIMEOUT_KEY, DEFAULT_TASKLOG_TIMEOUT);5.this.ugi = UserGroupInformation.getCurrentUser();

6.if(“local”.equals(tracker)){//如果是单机模式，new LocalJobRunner

7.conf.setNumMapTasks(1);

8.this.jobSubmitClient = new LocalJobRunner(conf);9.} else {

10.this.jobSubmitClient = createRPCProxy(JobTracker.getAddress(conf), conf);

11.} 12.}

分布式模式下就会创建一个RPC代理链接：

[java] view plaincopyprint?

1.public static VersionedProtocol getProxy(2.Class protocol，3.long clientVersion, InetSocketAddress addr, UserGroupInformation ticket，4.Configuration conf, SocketFactory factory, int rpcTimeout)throws IOException { 5.6.if(UserGroupInformation.isSecurityEnabled()){ 7.SaslRpcServer.init(conf);8.}

9.VersionedProtocol proxy =

10.(VersionedProtocol)Proxy.newProxyInstance（11.protocol.getClassLoader(), new Class[] { protocol }，12.new Invoker(protocol, addr, ticket, conf, factory, rpcTimeout));

13.long serverVersion = proxy.getProtocolVersion(protocol.getName(), 14.clientVersion);15.if(serverVersion == clientVersion){ 16.return proxy;17.} else {

18.throw new VersionMismatch(protocol.getName(), clientVersion, 19.serverVersion);20.} 21.}

从上述代码可以看出hadoop实际上使用了Java自带的Proxy API来实现Remote Procedure Call 初始完之后，需要提交job [java] view plaincopyprint?

1.info = jobClient.submitJobInternal(conf);//将job信息提交

submit方法做以下几件事情：

1.将conf中目录名字替换成hdfs代理的名字

2.检查output是否合法：比如路径是否已经存在，是否是明确的3.将数据分成多个split并放到hdfs上面，写入job.xml文件

4.调用JobTracker的submitJob方法

该方法主要新建JobInProgress对象，然后检查访问权限和系统参数是否满足job，最后addJob：

[java] view plaincopyprint?

1.private synchronized JobStatus addJob(JobID jobId, JobInProgress job)2.throws IOException { 3.totalSubmissions++;4.5.synchronized(jobs){

6.synchronized(taskScheduler){

7.jobs.put(job.getProfile().getJobID(), job);

8.for(JobInProgressListener listener : jobInProgressListeners){ 9.listener.jobAdded(job);10.} 11.} 12.}

13.myInstrumentation.submitJob(job.getJobConf(), jobId);14.job.getQueueMetrics().submitJob(job.getJobConf(), jobId);15.16.LOG.info(“Job ” + jobId + “ added successfully for user '”

17.+ job.getJobConf().getUser()+ “' to queue '”

18.+ job.getJobConf().getQueueName()+ “'”);19.AuditLogger.logSuccess(job.getUser()，20.Operation.SUBMIT_JOB.name(), jobId.toString());21.return job.getStatus();22.}

totalSubmissions记录client端提交job到JobTracker的次数。而jobs则是JobTracker所有可以管理的job的映射表

Map jobs = Collections.synchronizedMap(new TreeMap());taskScheduler是用于调度job先后执行策略的，其类图如下所示：

hadoop job调度机制； public enum SchedulingMode { FAIR, FIFO } 1.公平调度FairScheduler 对于每个用户而言，分布式资源是公平分配的，每个用户都有一个job池，假若某个用户目前所占有的资源很多，对于其他用户而言是不公平的，那么调度器就会杀掉占有资源多的用户的一些task，释放资源供他人使用 2.容量调度JobQueueTaskScheduler 在分布式系统上维护多个队列，每个队列都有一定的容量，每个队列中的job按照FIFO的策略进行调度。队列中可以包含队列。

两个Scheduler都要实现TaskScheduler的public synchronized List assignTasks(TaskTracker tracker)方法，该方法通过具体的计算生成可以分配的task

接下来看看JobTracker的工作：记录更新JobTracker重试的次数：

[java] view plaincopyprint?

1.while(true){ 2.try {

3.recoveryManager.updateRestartCount();4.break;

5.} catch(IOException ioe){

6.LOG.warn(“Failed to initialize recovery manager.”, ioe);7.// wait for some time

8.Thread.sleep(FS_ACCESS_RETRY_PERIOD);9.LOG.warn(“Retrying...”);10.} 11.}

启动Job调度器,默认是FairScheduler: taskScheduler.start();主要是初始化一些管理对象，比如job pool管理池

[java] view plaincopyprint?

1.// Initialize other pieces of the scheduler

2.jobInitializer = new JobInitializer(conf, taskTrackerManager);3.taskTrackerManager.addJobInProgressListener(jobListener);4.poolMgr = new PoolManager(this);5.poolMgr.initialize();

6.loadMgr =(LoadManager)ReflectionUtils.newInstance(7.conf.getClass(“mapred.fairscheduler.loadmanager”, 8.CapBasedLoadManager.class, LoadManager.class), conf);9.loadMgr.setTaskTrackerManager(taskTrackerManager);10.loadMgr.setEventLog(eventLog);11.loadMgr.start();

12.taskSelector =(TaskSelector)ReflectionUtils.newInstance(13.conf.getClass(“mapred.fairscheduler.taskselector”, 14.DefaultTaskSelector.class, TaskSelector.class), conf);15.taskSelector.setTaskTrackerManager(taskTrackerManager);16.taskSelector.start();

[java] view plaincopyprint?

1.JobInitializer有一个确定大小的ExecutorService threadPool，每个thread用于初始化job

[java] view plaincopyprint?

1.try {

2.JobStatus prevStatus =(JobStatus)job.getStatus().clone();3.LOG.info(“Initializing ” + job.getJobID());4.job.initTasks();

5.// Inform the listeners if the job state has changed 6.// Note : that the job will be in PREP state.7.JobStatus newStatus =(JobStatus)job.getStatus().clone();8.if(prevStatus.getRunState()!= newStatus.getRunState()){ 9.JobStatusChangeEvent event =

10.new JobStatusChangeEvent(job, EventType.RUN_STATE_CHANGED, prevStatus，11.newStatus);

12.synchronized(JobTracker.this){ 13.updateJobInProgressListeners(event);14.} 15.} 16.}

初始化操作主要用于初始化生成tasks然后通知其他的监听者执行其他操作。initTasks主要处理以下工作：

[java] view plaincopyprint?

1.// 记录用户提交的运行的job信息

2.try {

3.userUGI.doAs(new PrivilegedExceptionAction

hadoop学习心得

第一篇：hadoop学习心得

第二篇：Hadoop之JobTrack分析

第三篇：Hadoop常见错误总结

第四篇：Hadoop运维工程师岗位职责简洁版

第五篇：在三台虚拟机上部署多节点Hadoop

相关范文推荐

Hadoop的顶级汇报、分析、可视化、集成和开发工具

Hadoop之父与英特尔研究院院长分享大数据心得

基于Hadoop的云教学资源平台设计与实现

【八斗学院】2018年最新Hadoop大数据开发学习路线图5篇

【八斗学院】2018年最新Hadoop大数据简历,Hadoop工程师简历[5篇范例]

cent_OS_下hadoop完全分布式安装-hadoop2.6.1版-亲测自己总结

大数据培训零基础教学 Hadoop模式与搭建的相关问题（小编整理）

学习心得