With the rising demand for data science skills, the ability to wrangle data programmatically becomes a crucial barrier. In this paper, we discuss the centrality of API (application programming interface) lookup to data wrangling, and how an ontology-structured command menu could facilitate it. We design thumbnail graphics as visual alternatives to explaining data wrangling operations and use a survey to validate their quality. We furthermore predict that thumbnail graphics make the menu more navigable, improving lookup efficiency and performance. Our predictions are tested using Slice N Dice, an online data wrangling tutorial platform that collects learner activity. It includes both non-programmatic and programmatic data wrangling exercises. Participants from a multi-institutional sample (n = 200) were randomly assigned the tutorial either with or without thumbnail graphics. Our results show that thumbnail graphics reduce the need for clarifications, thereby assisting API lookup for novices learning data wrangling. We further present some negative results regarding performance gain and follow up with a discussion on why the differences are subtle and how they can be improved. Last but not least, we complement our statistical results with a qualitative study where we receive positive feedback from our participants on the design and helpfulness of the thumbnail graphics.
Citation: |
Figure 7. Part 3 involves programming exercises. The user is guided by a list of subgoals, each of which has associated hints. The sidebar menu serves as a menu for looking up API documentation (shown above). In reality, the menu, subgoals and documentation are all tabs within the same sidebar panel
Figure 8. Participants were asked to rate their experience with Excel, Python, and R on a five-point Likert-scale (1 = Not at all to 5 = Advanced). The distribution among people who started and completed each part is illustrated and does not provide any visual evidence for differences that would reflect that more experiences participants are more likely to persevere
[1] | V. Aleksić and M. Ivanović, Introductory programming subject in european higher education, Informatics in Education, 15 (2016), 163-182. doi: 10.15388/infedu.2016.09. |
[2] | A. C. Bart, J. Tibau, E. Tilevich, C. A. Shaffer and D. Kafura, Blockpy: An open access data-science environment for introductory programmers, Computer, 50 (2017), 18-26. doi: 10.1109/MC.2017.132. |
[3] | B. Baumer, A data science course for undergraduates: Thinking with data, The American Statistician, 69 (2015), 334-342. doi: 10.1080/00031305.2015.1081105. |
[4] | Y. Ben-David Kolikant and Z. ma'ayan, Computer science students' use of the internet for academic purposes: Difficulties and learning processes, Computer Science Education, 28 (2018), 211-231. doi: 10.1080/08993408.2018.1528045. |
[5] | J. Brandt, P. J. Guo, J. Lewenstein, M. Dontcheva and S. R. Klemmer, Two studies of opportunistic programming: Interleaving web foraging, learning, and writing code, in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2009, 1589–1598. doi: 10.1145/1518701.1518944. |
[6] | J. E. Broatch, S. Dietrich and D. Goelman, Introducing data science techniques by connecting database concepts and dplyr, Journal of Statistics Education, 27 (2019), 147-153. doi: 10.1080/10691898.2019.1647768. |
[7] | M. Cembalo, A. De Santis and U. Ferraro Petrillo, Savi: A new system for advanced sql visualization, in Proceedings of the 2011 Conference on Information Technology Education, 2011, 165–170. doi: 10.1145/2047594.2047641. |
[8] | CrowdFlower, Data Science Report 2016, http://www2.cs.uh.edu/ ceick/UDM/CFDS16.pdf, 2016, [Online; accessed 10-May-2021]. |
[9] | T. Diamantopoulos, G. Karagiannopoulos and A. L. Symeonidis, Codecatch: Extracting source code snippets from online sources, in Proceedings of the 6th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering, 2018, 21–27. |
[10] | B. Dorn, A. Stankiewicz and C. Roggi, Lost while searching: Difficulties in information seeking among end-user programmers, Proceedings of the American Society for Information Science and Technology, 50 (2013), 1-10. doi: 10.1002/meet.14505001059. |
[11] | I. Drosos, T. Barik, P. J. Guo, R. DeLine and S. Gulwani, Wrex: A unified programming-by-example interaction for synthesizing readable code for data scientists, in Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 2020, 1–12. doi: 10.1145/3313831.3376442. |
[12] | T. Erickson, M. Wilkerson, W. Finzer and F. Reichsman, Data moves, Technology Innovations in Statistics Education, 12 (2019). doi: 10.5070/T5121038001. |
[13] | H. Fangohr, A comparison of c, matlab, and python as teaching languages in engineering, in International Conference on Computational Science, Springer, 2004, 1210–1217. doi: 10.1007/978-3-540-25944-2_157. |
[14] | K. A. T. Folland, viSQLizer: An Interactive Visualizer for Learning SQL, Master's thesis, Norwegian University of Science and Technology, 2016. |
[15] | G. Gao, F. Voichick, M. Ichinco and C. Kelleher, Exploring programmers' api learning processes: Collecting web resources as external memory, in 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), IEEE, 2020, 1–10. doi: 10.1109/VL/HCC50065.2020.9127274. |
[16] | M. Ichinco, W. Y. Hnin and C. L. Kelleher, Suggesting api usage to novice programmers with the example guru, in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 2017, 1105–1117. doi: 10.1145/3025453.3025827. |
[17] | M. Ichinco and C. Kelleher, The need for improved support for interacting with block examples, in 2017 IEEE Blocks and Beyond Workshop (B & B), IEEE, 2017, 69–70. doi: 10.1109/BLOCKS.2017.8120415. |
[18] | S. Kandel, J. Heer, C. Plaisant, J. Kennedy, F. Van Ham, N. H. Riche, C. Weaver, B. Lee, D. Brodbeck and P. Buono, Research directions in data wrangling: Visualizations and transformations for usable and credible data, Information Visualization, 10 (2011), 271-288. doi: 10.1177/1473871611415994. |
[19] | S. Kandel, A. Paepcke, J. Hellerstein and J. Heer, Wrangler: Interactive visual specification of data transformation scripts, in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2011, 3363–3372. doi: 10.1145/1978942.1979444. |
[20] | C. Kelleher and M. Ichinco, Towards a model of API learning, in 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), IEEE, 2019, 163–168. doi: 10.1109/VLHCC.2019.8818850. |
[21] | R. Kimball, Data wrangling, Information Management, 18 (2008), 8. |
[22] | A. J. Ko and Y. Riche, The role of conceptual knowledge in api usability, in 2011 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), IEEE, 2011, 173–176. doi: 10.1109/VLHCC.2011.6070395. |
[23] | S. Krishnamurthi and K. Fisler, Data-centricity: A challenge and opportunity for computing education, Communications of the ACM, 63 (2020), 24-26. doi: 10.1145/3408056. |
[24] | S. Kross and P. J. Guo, Practitioners teaching data science in industry and academia: Expectations, workflows, and challenges, in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019, 1–14. doi: 10.1145/3290605.3300493. |
[25] | W. McKinney, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, "O'Reilly Media, Inc.", 2012. |
[26] | J. C. Nesbit and O. O. Adesope, Learning with concept and knowledge maps: A meta-analysis, Review of Educational Research, 76 (2006), 413-448. doi: 10.3102/00346543076003413. |
[27] | H. Niu, I. Keivanloo and Y. Zou, Learning to rank code examples for code search engines, Empirical Software Engineering, 22 (2017), 259-291. doi: 10.1007/s10664-015-9421-5. |
[28] | A. M. Olney and S. D. Fleming, A cognitive load perspective on the design of blocks languages for data science, in 2019 IEEE Blocks and Beyond Workshop (B & B), IEEE, 2019, 95–97. doi: 10.1109/BB48857.2019.8941224. |
[29] | A. Paivio, Mental Representations: A Dual Coding Approach, Oxford University Press, 1990. doi: 10.1093/acprof:oso/9780195066661.001.0001. |
[30] | N. Paton, Automating data preparation: Can we? should we? must we?, in 21st International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data, 2019, 1–5. |
[31] | D. Qiu, B. Li and H. Leung, Understanding the api usage in java, Information and Software Technology, 73 (2016), 81-100. doi: 10.1016/j.infsof.2016.01.011. |
[32] | RStudio, RStudio Cheat Sheets, https://github.com/rstudio/cheatsheets, 2021, [Online; accessed 03-June-2021]. |
[33] | D. Schuff, Data science for all: A university-wide course in data literacy, in Analytics and Data Science, Springer, 2018, 281–297. doi: 10.1007/978-3-319-58097-5_20. |
[34] | B. Shneiderman, Teaching programming: A spiral approach to syntax and semantics, Computers & Education, 1 (1977), 193-197. doi: 10.1016/0360-1315(77)90008-2. |
[35] | H. A. Simon, The structure of ill structured problems, Artificial Intelligence, 4 (1973), 181-201. doi: 10.1016/0004-3702(73)90011-8. |
[36] | S. Sosnovsky and T. Gavrilova, Development of educational ontology for c-programming, in XI-th International Conference, vol. 1, 2005, 127. |
[37] | L. Sundin and Q. Cutts, Introducing data wrangling using graphical subgoals-findings from an e-learning study, in Proceedings of the Eighth ACM Conference on Learning@ Scale, 2021, 267–270. doi: 10.1145/3430895.3460155. |
[38] | P. Teetor, R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics, "O'Reilly Media, Inc.", 2011. |
[39] | K. Thayer, S. E. Chasins and A. J. Ko, A theory of robust api knowledge, ACM Transactions on Computing Education (TOCE), 21 (2021), 1-32. doi: 10.1145/3444945. |
[40] | Tidyblocks.tech, TidyBlocks, https://github.com/tidyblocks/tidyblocks, 2021, [Online; accessed 21-Feb-2021]. |
[41] | D. Weinberger, Everything is Miscellaneous: The Power of the New Digital Disorder, Macmillan, 2007. |
[42] | D. Weintrop and U. Wilensky, To block or not to block, that is the question: Students' perceptions of blocks-based programming, in Proceedings of the 14th International Conference on Interaction Design and Children, 2015, 199–208. doi: 10.1145/2771839.2771860. |
[43] | H. Wickham and G. Grolemund, R for Data Science: Import, Tidy, Transform, Visualize, and Model Data, "O'Reilly Media, Inc.", 2016. |
[44] | X. Zhang and P. J. Guo, Ds. js: Turn any webpage into an example-centric live programming environment for learning data science, in Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, 2017, 691–702. doi: 10.1145/3126594.3126663. |
[45] | Y. Zhu, L. M. Hernandez, P. Mueller, Y. Dong and M. R. Forman, Data acquisition and preprocessing in studies on humans: What is not taught in statistics classes?, The American Statistician, 67 (2013), 235-241. doi: 10.1080/00031305.2013.842498. |
Kelleher & Ichinko's Collection and Organization of Information for Learning (COIL) model [20,15]
The platform is split into three parts. Part 1 introduces the user to an ontology of data wrangling operations. Part 2 introduces programming. Part 3 contains 18 programmatic data wrangling exercises
The sidebar menu and a Part 1 operation card under the two conditions
A snippet from the sidebar menu. Calculate is one of five top-level categories, while the next level is split by data structure (e.g. dataframes). In TG, each leaf node has a thumbnail graphic
Three examples of graphical thumbnail, using color in different ways to convey operation semantics
Part 1 contains a series of exercises in which the user selects operations from the menu and drags it to the corresponding subgoal
Part 3 involves programming exercises. The user is guided by a list of subgoals, each of which has associated hints. The sidebar menu serves as a menu for looking up API documentation (shown above). In reality, the menu, subgoals and documentation are all tabs within the same sidebar panel
Participants were asked to rate their experience with Excel, Python, and R on a five-point Likert-scale (1 = Not at all to 5 = Advanced). The distribution among people who started and completed each part is illustrated and does not provide any visual evidence for differences that would reflect that more experiences participants are more likely to persevere
The distributions in the number of tooltip events per person, grouped by condition, in Part 1 (left) and Part 3 (right). Dashed lines indicate medians. In both cases, the TG group uses the tooltip significantly less often
Total number of menu clicks per participant in Part 3, grouped by condition. Dashed lines indicate medians
Total reading times of operation cards per person, grouped by condition. Dashed lines indicate medians. The TG group is quicker on average, but the difference is non-significant
Total time on task per person for Part 1 (left) and 3 (right), grouped by condition. Dashed lines indicate medians. The median differences are in both parts negligible
Number of incorrect attempts per person for Part 1 (left) and 3 (right), grouped by condition. Dashed lines indicate medians. In both parts, the TG group makes fewer incorrect attempts, but the difference is not significant
Responses to the evaluation survey item asking participants how helpful they found the thumbnail graphics and tooltips (1 = Not at all, 5 = Very much) in Part 1 (N = 187) and Part 3 (N = 115). This survey was given both after Part 1 (left) and Part 3 (right)