# American Institute of Mathematical Sciences

doi: 10.3934/fods.2021032
Online First

Online First articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Online First publication benefits the research community by making new scientific discoveries known as quickly as possible.

Readers can access Online First articles via the “Online First” tab for the selected journal.

## Facilitating API lookup for novices learning data wrangling using thumbnail graphics

 1 University of Glasgow, UK 2 American University in Cairo, Egypt 3 University of Helsinki, Finland

* Corresponding author: l.sundin.1@research.gla.ac.uk

Received  July 2021 Revised  September 2021 Early access December 2021

With the rising demand for data science skills, the ability to wrangle data programmatically becomes a crucial barrier. In this paper, we discuss the centrality of API (application programming interface) lookup to data wrangling, and how an ontology-structured command menu could facilitate it. We design thumbnail graphics as visual alternatives to explaining data wrangling operations and use a survey to validate their quality. We furthermore predict that thumbnail graphics make the menu more navigable, improving lookup efficiency and performance. Our predictions are tested using Slice N Dice, an online data wrangling tutorial platform that collects learner activity. It includes both non-programmatic and programmatic data wrangling exercises. Participants from a multi-institutional sample (n = 200) were randomly assigned the tutorial either with or without thumbnail graphics. Our results show that thumbnail graphics reduce the need for clarifications, thereby assisting API lookup for novices learning data wrangling. We further present some negative results regarding performance gain and follow up with a discussion on why the differences are subtle and how they can be improved. Last but not least, we complement our statistical results with a qualitative study where we receive positive feedback from our participants on the design and helpfulness of the thumbnail graphics.

Citation: Lovisa Sundin, Nourhan Sakr, Juho Leinonen, Quintin Cutts. Facilitating API lookup for novices learning data wrangling using thumbnail graphics. Foundations of Data Science, doi: 10.3934/fods.2021032
##### References:

show all references

##### References:
,15]">Figure 1.  Kelleher & Ichinko's Collection and Organization of Information for Learning (COIL) model [20,15]
The platform is split into three parts. Part 1 introduces the user to an ontology of data wrangling operations. Part 2 introduces programming. Part 3 contains 18 programmatic data wrangling exercises
The sidebar menu and a Part 1 operation card under the two conditions
A snippet from the sidebar menu. Calculate is one of five top-level categories, while the next level is split by data structure (e.g. dataframes). In TG, each leaf node has a thumbnail graphic
Three examples of graphical thumbnail, using color in different ways to convey operation semantics
Part 1 contains a series of exercises in which the user selects operations from the menu and drags it to the corresponding subgoal
Part 3 involves programming exercises. The user is guided by a list of subgoals, each of which has associated hints. The sidebar menu serves as a menu for looking up API documentation (shown above). In reality, the menu, subgoals and documentation are all tabs within the same sidebar panel
Participants were asked to rate their experience with Excel, Python, and R on a five-point Likert-scale (1 = Not at all to 5 = Advanced). The distribution among people who started and completed each part is illustrated and does not provide any visual evidence for differences that would reflect that more experiences participants are more likely to persevere
The distributions in the number of tooltip events per person, grouped by condition, in Part 1 (left) and Part 3 (right). Dashed lines indicate medians. In both cases, the TG group uses the tooltip significantly less often
Total number of menu clicks per participant in Part 3, grouped by condition. Dashed lines indicate medians
Total reading times of operation cards per person, grouped by condition. Dashed lines indicate medians. The TG group is quicker on average, but the difference is non-significant
Total time on task per person for Part 1 (left) and 3 (right), grouped by condition. Dashed lines indicate medians. The median differences are in both parts negligible
Number of incorrect attempts per person for Part 1 (left) and 3 (right), grouped by condition. Dashed lines indicate medians. In both parts, the TG group makes fewer incorrect attempts, but the difference is not significant
Responses to the evaluation survey item asking participants how helpful they found the thumbnail graphics and tooltips (1 = Not at all, 5 = Very much) in Part 1 (N = 187) and Part 3 (N = 115). This survey was given both after Part 1 (left) and Part 3 (right)
 [1] Sarai Hedges, Kim Given. Addressing confirmation bias in middle school data science education. Foundations of Data Science, 2022  doi: 10.3934/fods.2021035 [2] Andreas Chirstmann, Qiang Wu, Ding-Xuan Zhou. Preface to the special issue on analysis in machine learning and data science. Communications on Pure & Applied Analysis, 2020, 19 (8) : i-iii. doi: 10.3934/cpaa.2020171 [3] Xin Guo, Lei Shi. Preface of the special issue on analysis in data science: Methods and applications. Mathematical Foundations of Computing, 2020, 3 (4) : i-ii. doi: 10.3934/mfc.2020026 [4] Weihong Guo, Yifei Lou, Jing Qin, Ming Yan. IPI special issue on "mathematical/statistical approaches in data science" in the Inverse Problem and Imaging. Inverse Problems & Imaging, 2021, 15 (1) : I-I. doi: 10.3934/ipi.2021007 [5] Karl R. B. Schmitt, Linda Clark, Katherine M. Kinnaird, Ruth E. H. Wertz, Björn Sandstede. Evaluation of EDISON's data science competency framework through a comparative literature analysis. Foundations of Data Science, 2021  doi: 10.3934/fods.2021031 [6] Subrata Dasgupta. Disentangling data, information and knowledge. Big Data & Information Analytics, 2016, 1 (4) : 377-389. doi: 10.3934/bdia.2016016 [7] Stefano Galatolo. Orbit complexity and data compression. Discrete & Continuous Dynamical Systems, 2001, 7 (3) : 477-486. doi: 10.3934/dcds.2001.7.477 [8] Alessia Marigo. Equilibria for data networks. Networks & Heterogeneous Media, 2007, 2 (3) : 497-528. doi: 10.3934/nhm.2007.2.497 [9] Pooja Bansal, Aparna Mehra. Integrated dynamic interval data envelopment analysis in the presence of integer and negative data. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021023 [10] Anna Chiara Lai, Monica Motta. Stabilizability in optimization problems with unbounded data. Discrete & Continuous Dynamical Systems, 2021, 41 (5) : 2447-2474. doi: 10.3934/dcds.2020371 [11] Alexandre J. Chorin, Fei Lu, Robert N. Miller, Matthias Morzfeld, Xuemin Tu. Sampling, feasibility, and priors in data assimilation. Discrete & Continuous Dynamical Systems, 2016, 36 (8) : 4227-4246. doi: 10.3934/dcds.2016.36.4227 [12] Richard Boire. Understanding AI in a world of big data. Big Data & Information Analytics, 2018  doi: 10.3934/bdia.2018001 [13] Xiaosheng Li, Gunther Uhlmann. Inverse problems with partial data in a slab. Inverse Problems & Imaging, 2010, 4 (3) : 449-462. doi: 10.3934/ipi.2010.4.449 [14] Roman Chapko, B. Tomas Johansson. Integral equations for biharmonic data completion. Inverse Problems & Imaging, 2019, 13 (5) : 1095-1111. doi: 10.3934/ipi.2019049 [15] Thomas R. Cameron, Sebastian Charmot, Jonad Pulaj. On the linear ordering problem and the rankability of data. Foundations of Data Science, 2021, 3 (2) : 133-149. doi: 10.3934/fods.2021010 [16] Marcel Oliver. The Lagrangian averaged Euler equations as the short-time inviscid limit of the Navier–Stokes equations with Besov class data in $\mathbb{R}^2$. Communications on Pure & Applied Analysis, 2002, 1 (2) : 221-235. doi: 10.3934/cpaa.2002.1.221 [17] Yujuan Li, Robert N. Hibbard, Peter L. A. Sercombe, Amanda L. Kelk, Cheng-Yuan Xu. Inspiring and engaging high school students with science and technology education in regional Australia. STEM Education, 2021, 1 (2) : 114-126. doi: 10.3934/steme.2021009 [18] Ida De Bonis, Daniela Giachetti. Singular parabolic problems with possibly changing sign data. Discrete & Continuous Dynamical Systems - B, 2014, 19 (7) : 2047-2064. doi: 10.3934/dcdsb.2014.19.2047 [19] Z. G. Feng, Kok Lay Teo, N. U. Ahmed, Yulin Zhao, W. Y. Yan. Optimal fusion of sensor data for Kalman filtering. Discrete & Continuous Dynamical Systems, 2006, 14 (3) : 483-503. doi: 10.3934/dcds.2006.14.483 [20] Sylvain Ervedoza, Enrique Zuazua. A systematic method for building smooth controls for smooth data. Discrete & Continuous Dynamical Systems - B, 2010, 14 (4) : 1375-1401. doi: 10.3934/dcdsb.2010.14.1375

Impact Factor:

## Tools

Article outline

Figures and Tables