Data Science Fundamentals
30 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is data science according to the content?

  • A field that only focuses on data mining
  • A multi-disciplinary field that uses scientific methods to extract knowledge and insights from structured, semi-structured, and unstructured data (correct)
  • A field that only deals with programming
  • A field that only analyzes data
  • What are the essential skills required for a data scientist?

  • Only programming knowledge
  • Curiosity, result-oriented, exceptional industry-specific knowledge, communication skills, and strong quantitative background in statistics and linear algebra (correct)
  • Only strong quantitative background in statistics and linear algebra
  • Only communication skills
  • What is the difference between data and information?

  • Data is processed, and information is unprocessed
  • Data is unprocessed, and information is processed (correct)
  • Data is quantitative, and information is qualitative
  • Data is qualitative, and information is quantitative
  • What are the three steps in the data processing cycle?

    <p>Input, Processing, Output</p> Signup and view all the answers

    What is data processing?

    <p>Re-structuring or re-ordering of data</p> Signup and view all the answers

    What is a data type from a computer programming perspective?

    <p>An attribute of data that tells the compiler how the programmer intends to use the data</p> Signup and view all the answers

    What is the primary purpose of a data type?

    <p>To define the operations that can be done on the data</p> Signup and view all the answers

    Which of the following data types is used to store a single character?

    <p>Character</p> Signup and view all the answers

    What is the main characteristic of structured data?

    <p>It adheres to a pre-defined data model</p> Signup and view all the answers

    What type of data is typically found in audio and video files?

    <p>Unstructured data</p> Signup and view all the answers

    What is the purpose of metadata?

    <p>To provide additional information about a specific set of data</p> Signup and view all the answers

    What is the main difference between semi-structured data and structured data?

    <p>Semi-structured data contains tags or other markers, while structured data does not</p> Signup and view all the answers

    What is the purpose of a data value chain?

    <p>To describe the information flow within a big data system</p> Signup and view all the answers

    What type of data is typically found in JSON and XML files?

    <p>Semi-structured data</p> Signup and view all the answers

    What is the process of gathering, filtering, and cleaning data before it is put in a data warehouse or any other storage?

    <p>Data Acquisition</p> Signup and view all the answers

    What is the primary goal of Data Curation?

    <p>To ensure data meets the necessary data quality requirements for its effective usage</p> Signup and view all the answers

    What is the characteristic of Big Data that refers to the speed at which data is generated or processed?

    <p>Velocity</p> Signup and view all the answers

    What is the primary focus of Data Storage?

    <p>To store data in a scalable way that satisfies the needs of applications</p> Signup and view all the answers

    What is the term for non-traditional strategies and technologies needed to gather, organize, process, and gather insights from large datasets?

    <p>Big Data</p> Signup and view all the answers

    What is the characteristic of Big Data that refers to the trustworthiness of data?

    <p>Veracity</p> Signup and view all the answers

    What is the term for the process of making the raw data acquired amenable to use in decision-making as well as domain-specific usage?

    <p>Data Analysis</p> Signup and view all the answers

    What is the primary focus of Data Usage?

    <p>To integrate data analysis within the business activity</p> Signup and view all the answers

    What is the primary benefit of resource pooling in clustered computing?

    <p>Combining available storage, CPU, and memory</p> Signup and view all the answers

    What is a characteristic of Hadoop?

    <p>Economical</p> Signup and view all the answers

    What is the primary function of HDFS in the Hadoop ecosystem?

    <p>Data storage</p> Signup and view all the answers

    What is the purpose of ingesting data into the Hadoop system?

    <p>To prepare the data for processing</p> Signup and view all the answers

    Which of the following is a benefit of using Hadoop for big data processing?

    <p>Improved scalability</p> Signup and view all the answers

    What is the primary function of YARN in the Hadoop ecosystem?

    <p>Data processing</p> Signup and view all the answers

    What is the final stage of the big data life cycle with Hadoop?

    <p>Visualizing the results</p> Signup and view all the answers

    What is the purpose of Zookeeper in the Hadoop ecosystem?

    <p>Data management</p> Signup and view all the answers

    Study Notes

    Data Science

    • Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured, semi-structured, and unstructured data.
    • Data scientists need to be curious, result-oriented, have exceptional industry-specific knowledge, strong communication skills, and a strong quantitative background in statistics and linear algebra.

    Data and Information

    • Data refers to unprocessed facts and figures, represented with the help of characters.
    • Information is processed or interpreted data, serving as a base for decisions and actions.

    Data Processing Cycle

    • Data processing involves re-structuring or re-ordering of data.
    • The data processing cycle consists of three steps: input, processing, and output.

    Data Types and Representation

    • From a computer programming perspective, data types include booleans, characters, floating-point numbers, and alphanumeric strings, which define the operations that can be done on the data, its meaning, and how it can be stored.
    • From a data analytics perspective, there are three common types of data:
      • Structured data: adheres to a pre-defined data model, straightforward to analyze, and conforms to a tabular format with rows and columns.
      • Semi-structured data: also known as self-describing structure, contains tags or other markers, and is typically text-heavy.
      • Unstructured data: does not have a pre-defined data model, is not organized in a pre-defined manner, and is typically text-heavy.

    Metadata

    • Metadata is data about data, providing additional information about a specific set of data.
    • Examples of metadata include fields for dates and locations for a photograph taken.

    Data Value Chain

    • The data value chain describes the information flow within a big data system, consisting of:
      • Data Acquisition: gathering, filtering, and cleaning data before it is put in a data warehouse or storage.
      • Data Analysis: making the raw data acquired amenable to use in decision-making and domain-specific usage.
      • Data Curation: actively managing data over its life cycle to ensure it meets the necessary data quality requirements.
      • Data Storage: persisting and managing data in a scalable way that satisfies the needs of applications that require fast access to the data.
      • Data Usage: covers the data-driven business activities that need access to data, its analysis, and the tools needed to integrate the data analysis within the business activity.

    Big Data

    • Big data refers to large and complex datasets that exceed the computing power or storage of a single computer.
    • Big data is characterized by 3V and more:
      • Volume: large datasets.
      • Velocity: live streaming or in motion.
      • Variety: in many different forms from diverse sources.
      • Veracity: trusting the data and ensuring its accuracy.

    Clustered Computing and Hadoop Ecosystem

    • Clustered computing combines the resources of many smaller machines, providing benefits such as resource pooling, high availability, easy scalability, and fault tolerance.
    • Hadoop is an open-source framework for distributed processing of large datasets across clusters of computers, characterized by being economical, reliable, scalable, and flexible.
    • Hadoop has an ecosystem that includes four core components: data management, access, processing, and storage, and is continuously growing to meet the needs of big data.

    Big Data Life Cycle with Hadoop

    • The big data life cycle with Hadoop involves:
      • Ingesting data into the system from various sources.
      • Processing the data in storage.
      • Computing and analyzing data using processing frameworks.
      • Visualizing the results for user access.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Emerging Tech Chap 2.pdf

    Description

    Learn about the basics of data science, including its definition, key skills, and concepts related to data and information.

    More Like This

    Use Quizgecko on...
    Browser
    Browser