30  Data Understanding

In the realm of data science, a significant portion of time is devoted to sourcing relevant data and preparing it for predictive analysis. This process is crucial, especially in supervised learning, where the model’s success hinges on the availability of a labeled dataset that accurately represents the problem. For instance, if we aim to predict restaurant wait times, we’d need a dataset with entries for each order, marked with the actual wait time experienced by the customers. Sometimes, it might be necessary to enhance existing data collection methods, such as implementing a system to track the time from order placement to food delivery.

To locate, understand, and assess the quality of the data sources needed to answer the research question. You should consider:

Poll Time

Q1

Now, suppose you are working on predicting restaurant wait times based on various factors such as order time, order size, and staffing levels. You will be provided with a few potential data sources. You will work in groups to evaluate different potential data sources based on their appropriateness for the specified research question. For each source, you will need to rate the source based on its data completeness, correctness, and relevance.

Data Sources to Evaluate:

  • Database A: Contains detailed order data and staffing levels but lacks wait times.
  • API B: Provides real-time access to order data and wait times but only for the past month.
  • Survey C: Customer feedback data including perceived wait times and order satisfaction, collected irregularly.
  • Public Dataset D: A publicly available dataset from a study conducted last year, containing order data, staffing, and actual wait times marked.
  1. Evaluate Database A for Predicting Restaurant Wait Times:
  • How do you rate its completeness? (Poor, Fair, Good, Excellent)
  • How do you rate its correctness? (Poor, Fair, Good, Excellent)
  • How do you rate its relevance? (Poor, Fair, Good, Excellent)
  1. Evaluate API B for Predicting Restaurant Wait Times:
  • How do you rate its completeness? (Poor, Fair, Good, Excellent)
  • How do you rate its correctness? (Poor, Fair, Good, Excellent)
  • How do you rate its relevance? (Poor, Fair, Good, Excellent)
  1. Evaluate Survey C for Predicting Restaurant Wait Times:
  • How do you rate its completeness? (Poor, Fair, Good, Excellent)
  • How do you rate its correctness? (Poor, Fair, Good, Excellent)
  • How do you rate its relevance? (Poor, Fair, Good, Excellent)
  1. Evaluate Public Dataset D for Predicting Restaurant Wait Times:
  • How do you rate its completeness? (Poor, Fair, Good, Excellent)
  • How do you rate its correctness? (Poor, Fair, Good, Excellent)
  • How do you rate its relevance? (Poor, Fair, Good, Excellent)

Access the live poll here: https://PollEv.com/weihongni276

Setting up a robust data recording system and compiling sufficient data for training demands substantial effort and time. Once the data is collected, the next crucial step is to thoroughly understand it, which will guide the choice of an appropriate model and help interpret the outputs effectively. Initial steps typically include:

Visualizing the data is equally important. Techniques might include:

For example, analyzing the human heights data might reveal an average of 168.3 cm. However, a histogram could show a bimodal distribution, with one cluster around 160 cm and another around 170 cm. This insight could lead to a refinement of the research question or model strategy. Perhaps the model could be designed to classify data points into ‘female’ or ‘male’ heights based on factors like parental heights, dietary quality, caloric intake.

Understanding these dynamics is crucial because if our model simply predicts average heights around 168.3 cm, it ignores the true structure of the data, potentially leading to inaccurate or misleading predictions. Thus, the phase of finding and understanding data is not just about collection but about laying a solid groundwork for the predictive modeling that follows.

Poll Time

Q1

Imagine you are working with a dataset that includes the heights of adults recorded over the last decade. The dataset includes variables such as height, gender, age, and geographical location. Initial analysis indicates potential variations in height distribution across different groups.

Given the dataset described, which visualization technique would be most effective to initially explore the distribution of heights across different genders and identify any potential outliers?

  1. Box plots
  2. Histograms
  3. Scatter plots
  4. Line graphs

Given the dataset described, which visualization technique would be most effective to examine the relationship between height and age?

  1. Box plots
  2. Histograms
  3. Scatter plots
  4. Line graphs

Access the live poll here: https://PollEv.com/weihongni276