Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

The web page at https://data.london.gov.uk/dataset/london-atmospheric-emissions-

ID: 3742113 • Letter: T

Question

The web page at https://data.london.gov.uk/dataset/london-atmospheric-emissions-inventory-2013 describes the various types of data held for monitored atmospheric emissions for the year 2013. Note that ADSS will ultimately be working on datasets compiled during a 50-year period, but you will be focusing only on a single year (2013), in the first instance.

What are the merits and demerits of:

Keeping the data sets on the local file system where they are normally held?

Moving the data to Hadoop HDFS storage?

Explanation / Answer

The local file system is the most traditional and an easy way to store the data. The files are in independent files , which does not have an organizing structure. Accessing of data in a local file system is easy for physical access of the data. In a file system, data is stored in the form of isolated files which have their own set of properties and location on the drive so that the user manually goes to these locations to access the files.

Some of the advantages of storing in local file system are :

File Compression :The file systems can compress the files, still permitting the applications to access them. The system automatically decompresses the file when we need them and compresses it again when the file is closed.

But more than the advantages, there are so many disadvantages for the local storage system as the technology has improved. The disadvantages are :

Data Redundancy: Data redundancy is more where the same file can be duplicated and can exist with same information. Duplicate data makes it easier to locate but this causes more memory wastage.

Data Integrity : It is  hard to achieve data integrity where consistancy of data must be kept unchanged.

Data Sharing : Limited data sharing is possible here due to data isolation, it is difficult to share data among different applications. Files may be private and a user outside the application would not be able to access the files.

As the files as isolated , the files dont communicate with each other . So it is impossible to merge or compare the two files with each other if needed.

In a Hadoop HDFS storage, It implements a distributed file system that provides a high performance data access along the system. It is an efficiant platform to process the Bigdata where a huge amount of structured data is being processed. In HDFS , the data is breakdown to seperate blocks and is distributed to different nodes in a cluster , which supports parallel processing of the data.

The system is also highly Fault Tolerant. For every set of data , the data is copied and multiple set of this data is stored in different racks of different servers. Even if a server crashes, the copy of this data will be available , and can be recovered.

HDFS uses master/slave architecture with a single nameNode and a number of dataNodes. This technology helps to support applications with large data sets.

The applications can be run on thousands of nodes enabling high scalability for very large set of data across hundreds of servers which operate in parallel. This also makes Hadoop a  cost effective storage solution.

Hadoop HDFS is also Fast and Flexible. The processing is fast as 'maps' are used to locate each and every data over the clusters. Even a very large set of data can be accessed in a few seconds.

Some of the Drawbacks of Hadoop HDFS are :

Latency : Since the MapReduce is designed to support data with different format, structure and huge volume , its framework is comparatively slower and requires a large amount of time to perform the tasks thereby increasing latency.

Lack of Abstraction : HDFS does not use any type of abstraction which makes the work difficut for the developers.

Vulnerable : There are a lot of security issues and it is vulnerable in nature. As the Hadoop code is written in java , which is a widely used programming language , there is a chance for cybercriminals to exploit it , making it more vulnerable.