Big Data Analytics Tools

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which factor is most important when choosing between programming languages, statistical solutions, and visualization tools for big data analytics?

  • Your background in programming and statistical knowledge. (correct)
  • The popularity of the tool in the industry.
  • The availability of online tutorials and documentation.
  • The cost of the software licenses.

In the context of Python, what is the primary role of the NumPy library?

  • Facilitating web-based interactive notebooks.
  • Enabling integration with other programming languages like Java and C++.
  • Providing high-level data visualization capabilities.
  • Serving as the fundamental package for numerical computation and array manipulation. (correct)

What advantage does Cython provide in the context of data science and Python programming?

  • Improved tools for statistical analysis and modeling.
  • Better integration with cloud-based services and platforms.
  • The ability to use Python syntax with increased compiling performance by incorporating C code. (correct)
  • Enhanced support for web application development.

Why is R considered a versatile language for statistics and data science?

<p>It offers a wide array of tools for statistical analysis, machine learning, and graphical techniques. (C)</p> Signup and view all the answers

What is the primary function of SAS Text Miner?

<p>To facilitate the text mining process, aiding in prediction with a set of tools. (D)</p> Signup and view all the answers

How does MS SQL Server facilitate the integration of relational and non-relational data management?

<p>By providing a Sqoop-based connector for Apache Hadoop. (B)</p> Signup and view all the answers

Why might data scientists opt to use lower-level statistics or programming languages despite the availability of high-level platforms?

<p>To handle tasks requiring high concurrency and CPU-bound threads, which may be slower in high-level languages. (A)</p> Signup and view all the answers

Which consideration differentiates statistical solutions like SPSS, STATA, and Minitab?

<p>Features that affect user choice. (B)</p> Signup and view all the answers

What is a notable capability of Tableau regarding R integration?

<p>It enables Tableau users to access and utilize the data analysis libraries in R without needing to learn R in detail. (B)</p> Signup and view all the answers

What makes D3 (Data-Driven Documents) a suitable choice for web visualization?

<p>Its use of a toolkit for hiding the underlying scenegraph and enabling DOM manipulation. (A)</p> Signup and view all the answers

Flashcards

Big Data

Big data is a large pool of data that can be captured, communicated, aggregated, stored, and analyzed.

Big data analytics review

Systematic review of programming languages, statistical tools, analytical solutions and visualization applications available in the big data analytics area.

Big Data Cloud Platforms

Open source frameworks, owned by companies to handle big data challenges.

Python

A famous data analyzing language, high-level and interactive with scientific ecosystem libraries.

Signup and view all the flashcards

Ndarray

An efficient and fast multidimensional array object in python.

Signup and view all the flashcards

Matplotlib

For visualization data and drawing expressive plots.

Signup and view all the flashcards

R Language

Extremely versatile open source programming language for statistics and data science.

Signup and view all the flashcards

SAS software

Accessing, transformation and reporting data using its flexible, extensible and web-based interface.

Signup and view all the flashcards

SAS Model Manager

Arranges and organizes the steps of building analytical model collections.

Signup and view all the flashcards

MS SQL Server

A very famous solution for traditional relational database

Signup and view all the flashcards

Study Notes

  • Big data is a large pool of captured, communicated, aggregated, stored, and analyzed data.
  • Big data offers data scientists the opportunity to innovate and practice algorithms for analyzing unstructured data.
  • Data scientists require specific knowledge and powerful languages/tools to fully appreciate and carry out tasks using big data.
  • This paper systematically reviews programming languages, statistical tools, analytical solutions, and visualization applications for big data analytics.

Big Data Analytics

  • To work in big data analytics, data scientists should choose tools that enable them to achieve analyzing tasks effectively.
  • Analytical tools for data are categorized into: programming languages, statistical solutions, and visualization tools.
  • Choosing a tool depends on one's background in programming and statistics.
  • R language requires a strong science programming and statistics background.
  • Visualization tools allows the user to interact with data without prior knowledge.

Big Data Cloud Platforms

  • Frameworks for handling big data challenges include open-source (Apache Hadoop, SciDB) and proprietary (Google, IBM, Amazon, Microsoft).
  • Platforms are built on the cloud based on the features of these frameworks.
  • Platforms include Google AppEngine, Microsoft Azur, Amazon EC2, each having its own way to handle big data problems.
  • Big data challenges include data storage, analytics, database issues, and machine learning implementation.

Programming Languages

  • Numerous languages cater to variant tasks in big data analytics, divided into high-level and low-level languages based on analytical usage.
  • High-level analytical languages vary in learning time and expertise required.

Python

  • Python is a favored data analysis language, used by data scientists for research.
  • Python's high-level nature and scientific libraries make it suitable for developing analytical algorithms
  • Its scientific ecosystem is useful for applications in both industry and academia.
  • NumPy ("Numerical Python") serves as the base data structure and fundamental package.
  • All input data in Python is represented as a NumPy array.
  • NumPy provides:
    • An efficient, fast multidimensional array object (Ndarray).
    • Functions for array manipulation.
    • Tools for reading/writing array-based data sets to disk.
    • Fourier transform, linear algebra operations, and random number generation.
    • Tools to integrate other language code (C, C++, FORTRAN).
  • Pandas supports scientists and fastens tasks on structured data via pre-designed functions and rich data structures.
  • Matplotlib is popular for data visualization and drawing expressive plots, especially 2D plots.
  • IPython is an interactive computing/development environment that maximizes productivity.
  • SciPy is a collection of packages containing efficient algorithms for:
    • Linear algebra.
    • Sparse matrix representation.
    • Special functions.
    • Basic statistical functions.
  • SciPy includes bindings for Fortran-based numerical packages like LAPACK.
  • Cython enables data scientists to use Python syntax and high-level operations while increasing compiling performance by combining C in Python.

R Language

  • R is a versatile open-source programming language for statistics and data science.
  • Statistical computation in R can be done using functional-based syntax or program-based code.
  • R features:
    • Short, simple syntax.
    • Variant formats for loading/storing data (local/internet).
    • In-memory task performance using consistent syntax.
    • Extensive collection of tools (functions/packages) for data analysis.
    • Easy methods of representing statistical results graphically, with graph storage.
    • Ability to automate analyses, create new functions, & extend existing features.
    • Saves data between sessions and command history.
  • GUI options for R include RStudio, ESS, R Commander, JGR Java GUI, and StatET.
  • R is available for Windows, Macintosh, and Linux.
  • R provides tools for statistical analysis, machine learning, time-series analysis, classification, clustering, and graphical techniques, and is extensible.
  • Various built-in/extended functions in R support statistical, machine learning, and visualization tasks, such as:
    • Data extraction.
    • Data cleaning.
    • Data loading.
    • Data transformation.
    • Statistical analysis.
    • Data visualization.
    • Predictive modeling.

SAS (Software)

  • SAS software, with its language, is a common solution for accessing, transforming, and reporting data via a flexible, extensible, web-based interface.
  • The SAS Analytics Platform consists of analytical applications that form the application framework.
  • SAS Text Miner can be added to SAS Enterprise Miner to facilitate text mining (prediction aspect) and supporting it with tools.
  • SAS forecast server's automation and scalability enable future-oriented decisions and generate large quantities of high-quality, automated forecasts, increasing the efficiency of forecasts for a range of planning challenges.
  • SAS Model Manager arranges steps to build analytical model collections, from creating and managing/monitoring to administering, providing decision makers with a web-based tools and support of lifecycle management/governance.

MS SQL Server

  • MS SQL Server is a solution for traditional relational databases, and has graphical tools for ERDs.
  • MS SQL Server now covers BI concepts as distributed features.
  • MS SQL Server has an integrated Analysis Service 2012, related to the Tabular Model, Multidimensional Model, and the Microsoft BI stack.
  • There are three main BI services in MS SQL Server 2012:
    • SQL Server Integration Service (SSIS) for data collection.
    • SQL Server Analysis Service (SSAS) for data analysis.
    • SQL Server Representation Service (SSRS) for data viewing (visualization).
  • Microsoft SQL Server built a Sqoop-based connector, the SQL Server-Hadoop connector to provide an efficient connection between SQL Server and Hadoop.
  • MS SQL Server 2012 provides tools for data integration, visualization, a Business Intelligence suite, and efficient connection to Apache Hadoop and Hive.

Statistical Solutions

  • Statistical solutions include SPSS, STATA, MiniTab, Statistica.
  • STATA is strong, but not user-friendly for non-statistical users.

SPSS (Statistical Package for the Social Sciences)

  • SPSS enables data analysts to make decisions depending on predictive results to know what will happen next.
  • The Software known as "IBM SPSS Statistics" current version of this software includes all features and add-ons.

Visualization Tools

  • Visualization tools include Python, R, SAS, Julia, MATLAB, Tableau, Qlikview, Spotfire, Congos, D3, Protovis, etc.
  • Tableau is for general visualization, and D3 is for web visualization.
  • Tableau allows data visualization with perfect manners, allows simple plots (easy comparing) and in one frame with filtering/customization (full control).
  • Tableau new feature (starting with tableau 8.1) R integration offers next capabilities:
    • User access to R's wide data analysis libraries.
    • Tableau users can include R command in any of the four function options from Tableau and use any result for building a graph by Tableau again.
    • R users use Tableau for exploring data by applying it on their R code.
  • D3 (Data-Driven Documents) is for web visualization purposes, and enables DOM manipulation and has developer tools.

Analysis and Comparison of Analytical Tools

  • STATA is the best statistical software, though it requires good knowledge of statistical concepts. Almost new users will prefer SPSS than other statistical software.
  • For professional visualization, Tableau is the best for graphs, charts, maps, etc.
  • Benchmarking comparison between D3 (faster twice than Protovis and over three times than Flash) and another tools like Protovis and Flash confirms that D3 is better.

Analytical Tools Usage

  • KDnuggets poll (2014) found that R, SAS, Python, and SQL are the four main languages for Analytics, Data Mining, and Data Science

Python vs R.

  • Both Python and R are similar programming languages, with some differences.

Conclusion

  • R language is a common programming language for data scientists.
  • SPSS is a good statistical tool for non-statisticians.
  • Tableau Public is a perfect visualization tool to present data and analyze it in a graphical way.
  • D3 will be the best choice for web visualization purposes.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser