NLP-Based Approach to Semantic Classification of Heterogeneous Transportation Asset Data Terminology
Publication: Journal of Computing in Civil Engineering
Volume 31, Issue 6
Abstract
The inconsistency of data terminology has imposed big challenges on integrating transportation project data from distinct sources. Differences in meaning of data elements may lead to miscommunication between data senders and receivers. Semantic relations between terms in digital dictionaries, such as ontologies, can enable the semantics of a data element to be transparent and unambiguous to computer systems. However, because of the lack of effective automated methods, identifying these relations is labor intensive and time consuming. This paper presents a novel integrated methodology that leverages multiple computational techniques to extract heterogeneous American-English data terms used in different highway agencies and their semantic relations from design manuals and other technical specifications. The proposed method implements natural language processing (NLP) to detect data elements from text documents and uses machine learning to determine the semantic relatedness among terms using their occurrence statistics in a corpus. The study also consists of developing an algorithm that classifies semantically related terms into three different lexical groups including synonymy, hyponymy, and meronymy. The key merit in this technique is that the detection of semantic relations uses only linguistic information in texts and does not depend on other existing hand-coded semantic resources. A case study was undertaken that implemented the proposed method on a 16-million-word corpus of roadway design manuals to extract and classify roadway data items. The developed classifier was evaluated using a human-encoded test set, and the results show an overall performance of 92.76% in precision and 81.02% recall.
Get full access to this article
View all available purchase options and get full access to this article.
Acknowledgments
This research was funded by the National Science Foundation (NSF) through Award NSF-CIS 420-60-83. The authors gratefully acknowledge NSF’s support. Any opinions, findings, conclusions, and recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of NSF.
References
Abuzir, Y., and Abuzir, M. O. (2002). “Constructing the civil engineering thesaurus (CET) using the ThesWB.” Computing in civil engineering, ASCE, Reston, VA.
Ananiadou, S., Albert, S., and Schuhmann, D. (2000). “Evaluation of automatic term recognition of nuclear receptors from MEDLINE.” Genome Inf., 11, 450–451.
Apache OpenNLP. (2017). “OpenNLP.” ⟨https://opennlp.apache.org/⟩ (Apr. 2, 2017).
Bittner, T., Donnelly, M., and Winter, S. (2005). “Ontology and semantic interoperability.” Large-scale 3D data integration: Challenges and opportunities, CRC Press, Boca Raton, FL, 139–160.
buildingSMART. (2016). “buildingsmart data dictionary.” ⟨http://bsdd.buildingsmart.org/⟩ (Mar. 15, 2016).
Cambria, E., and White, B. (2014). “Jumping NLP curves: A review of natural language processing research.” IEEE Comput. Intell. Mag., 9(2), 48–57.
CeTermClassifier. (2017). “GitHub.” ⟨https://github.com/tuyenbk/CeTermClassifier⟩ (Jul. 15, 2017).
Chen, D., and Manning, C. D. (2014). “A fast and accurate dependency parser using neural networks.” Proc., 2014 Conf. on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, 740–750.
Church, K. W., and Hanks, P. (1990). “Word association norms, mutual information, and lexicography.” Comput. Ling., 16(1), 22–29.
Costa-Jussa, M. R., Farrús, M., Mariño, J. B., and Fonollosa, J. A. (2012). “Study and comparison of rule-based and statistical Catalan-Spanish machine translation systems.” Comp. Inf., 31(2), 245–270.
Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V. (2002). “GATE: A framework and graphical development environment for robust NLP tools and applications.” Proc., 40th Anniversary Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudburg, PA, 168–175.
El-Diraby, T., and Kashif, K. (2005). “Distributed ontology architecture for knowledge management in highway construction.” J. Constr. Eng. Manage., 591–603.
El-Diraby, T., Lima, C., and Feis, B. (2005). “Domain taxonomy for construction concepts: Toward a formal ontology for construction knowledge.” J. Comput. Civil Eng., 394–406.
Erk, K. (2012). “Vector space models of word meaning and phrase meaning: A survey.” Lang. Ling. Compass, 6(10), 635–653.
Frantzi, K., Ananiadou, S., and Mima, H. (2000). “Automatic recognition of multi-word terms: The C-value/NC-value method.” Int. J. Digital Libraries, 3(2), 115–130.
Gallaher, M. P., O’Connor, A. C., Dettbarn, J. L., and Gilday, L. T. (2004). Cost analysis of inadequate interoperability in the U.S. capital facilities industry, U.S. Dept. of Commerce Technology Administration, National Institute of Standards and Technology, Gaithersburg, MD.
Harispe, S., Ranwez, S., Janaqi, S., and Montmain, J. (2013). “Semantic measures for the comparison of units of language, concepts or instances from text and knowledge base analysis.” arXiv 1310, 1285.
Harris, Z. S. (1954). “Distributional structure.” Word, 10(2–3), 146–162.
Harrison, F., Gordon, M., and Allen, G. (2016). “Leadership guide for strategic information management for state departments of transportation.”, National Academies Press, Washington, DC.
Hearst, M. A. (1992). “Automatic acquisition of hyponyms from large text corpora.” Proc., 14th Conf. on Computational Linguistics, Vol. 2, Association for Computational Linguistics, Stroudsburg, PA, 539–545.
Heiler, S. (1995). “Semantic interoperability.” ACM Comput. Surv., 27(2), 271–273.
Hezik, M. (2008). “IFD library background and history.” The IFD Library/IDM/IFC/MVD Workshop, Building Smart, VA.
Hill, F., Reichart, R., and Korhonen, A. (2015). “Simlex-999: Evaluating semantic models with (genuine) similarity estimation.” Comput. Ling., 41(4), 665–695.
Hsu, J.-Y. (2013). “Content-based text mining technique for retrieval of CAD documents.” Autom. Constr., 31, 65–74.
Inkpen, D., and Hirst, G. (2006). “Building and using a lexical knowledge base of near-synonym differences.” Comput. Ling., 32(2), 223–262.
ISO. (2007). “Building construction—Organization of information about construction works. Part 3: Framework for object-oriented information.” ISO 12006-3, Geneva.
Jivani, A. (2011). “A comparative study of stemming algorithms.” Int. J. Comp. Tech. Appl., 2(6), 1930–1938.
Justeson, J. S., and Katz, S. M. (1995). “Technical terminology: Some linguistic properties and an algorithm for identification in text.” Nat. Lang. Eng., 1(1), 9–27.
Karimi, H. A., Akinci, B., Boukamp, F., and Peachavanish, R. (2003). “Semantic interoperability in infrastructure systems.” 4th Joint Int. Symp. on Information Technology in Civil Engineering, ASCE, Reston, VA, 42–42.
Kolb, P. (2008). “Disco: A multilingual database of distributionally similar words.” Proc., Konferenz zur Verarbeitung natürlicher Sprache (KONVENS)-2008, Berlin.
Lefler, N. X. (2014). “Roadway safety data interoperability between local and state agencies.”, National Academies Press, Washington, DC.
Lesk, M. (1986). “Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone.” Proc., 5th Annual Int. Conf. on Systems Documentation, Association for Computing Machinery, New York, 24–26.
Levy, O., Goldberg, Y., and Dagan, I. (2015). “Improving distributional similarity with lessons learned from word embeddings.” Trans. Assoc. Comput. Ling., 3, 211–225.
Lima, C., El-Diraby, T., and Stephens, J. (2005). “Ontology-based optimization of knowledge management in e-construction.” J. IT Constr., 10(21), 305–327.
Lopes, L., and Vieira, R. (2015). “Evaluation of cutoff policies for term extraction.” J. Braz. Comput. Soc., 21(1), 9.
Lossio-Ventura, J. A., Jonquet, C., Roche, M., and Teisseire, M. (2013). “Combining C-value and keyword extraction methods for biomedical terms extraction.” LBM’2013: 5th Int. Symp. on Languages in Biology and Medicine, Database Center for Life Science, Tokyo.
Lv, X., and El-Gohary, N. M. (2015). “Semantic annotation for context-aware information retrieval for supporting the environmental review of transportation projects.” 2015 Int. Workshop on Computing in Civil Engineering, ASCE, Reston, VA, 165–172.
MacQueen, J. (1967). “Some methods for classification and analysis of multivariate observations.” Proc., 5th Berkeley Symp. on Mathematical Statistics and Probability, Vol. 1, Oakland, CA, 281–297.
Marcus, M. (1995). “New trends in natural language processing: Statistical natural language processing.” Proc. Nat. Acad. Sci., 92(22), 10052–10059.
Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). “Building a large annotated corpus of English: The Penn Treebank.” Comput. Ling., 19(2), 313–330.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). “Efficient estimation of word representations in vector space.” arXiv 1301, 3781.
Mounce, S., Brewster, C., Ashley, R., and Hurley, L. (2010). “Knowledge management for more sustainable water systems.” J. Inf. Technol. Constr., 15(11), 140–148.
Navigli, R. (2009). “Word sense disambiguation: A survey.” ACM Comput. Surv., 41(2), 1–69.
Navigli, R., and Velardi, P. (2010). “Learning word-class lattices for definition and hypernym extraction.” Proc., 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, 1318–1327.
Nenadić, G., Spasić, I., and Ananiadou, S. (2002). “Automatic acronym acquisition and term variation management within domain-specific texts.” 3rd Int. Conf. on Language Resources and Evaluation (LREC2002), European Language Resources Association, Paris, 2155–2162.
Noy, N. F. (2004). “Semantic integration: A survey of ontology-based approaches.” ACM Sigmod Rec., 33(4), 65–70.
Osman, H., and Ei-Diraby, T. (2006). “Ontological modeling of infrastructure products and related concepts.” Transp. Res. Rec., 1984, 159–167.
Ouksel, A. M., and Sheth, A. (1999). “Semantic interoperability in global information systems.” ACM Sigmod Rec., 28(1), 5–12.
Pantel, P., and Pennacchiotti, M. (2006). “Espresso: Leveraging generic patterns for automatically harvesting semantic relations.” Proc., 21st Int. Conf. on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, 113–120.
Pennington, J., Socher, R., and Manning, C. D. (2014). “GloVe: Global vectors for word representation.” ⟨http://www.aclweb.org/anthology/D14-1162⟩ (Mar. 7, 2017).
PlingStemmer [Computer software]. Cognitive Computation Group, Urbana, IL.
Princeton University. (2017). “About WordNet.” ⟨http://wordnet.princeton.edu/wordnet/⟩ (Mar. 7, 2017).
Radim, R. (2014). “Word2vec tutorial.” ⟨http://rare-technologies.com/word2vec-tutorial/⟩ (Mar. 3, 2017).
Rezgui, Y. (2007). “Text-based domain ontology building using Tf-Idf and metric clusters techniques.” Knowl. Eng. Rev., 22(04), 379–403.
Salton, G., and Buckley, C. (1988). “Term-weighting approaches in automatic text retrieval.” Inf. Process. Manage., 24(5), 513–523.
Sclano, F., and Velardi, P. (2007). “Termextractor: A web application to learn the shared terminology of emergent web communities.” Enterprise interoperability, Springer, London, 287–290.
Seedah, D. P., Choubassi, C., and Leite, F. (2015a). “Ontology for querying heterogeneous data sources in freight transportation.” J. Comput. Civil Eng., 04015069.
Seedah, D. P., Sankaran, B., and O’Brien, W. J. (2015b). “Approach to classifying freight data elements across multiple data sources.” Transp. Res. Rec., 2529, 56–65.
Sparck Jones, K. (1972). “A statistical interpretation of term specificity and its application in retrieval.” J. Doc., 28(1), 11–21.
Suchanek, F. M., Ifrim, G., and Weikum, G. (2006). “Leila: Learning to extract information by linguistic analysis.” Proc., 2nd Workshop on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, Association for Computational Linguistics, Stroudsburg, PA, 18–25.
Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). “Feature-rich part-of-speech tagging with a cyclic dependency network.” Proc., 2003 Conf. of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, Association for Computational Linguistics, Stroudsburg, PA, 173–180.
Turney, P. D., and Pantel, P. (2010). “From frequency to meaning: Vector space models of semantics.” J. Artif. Intell. Res., 37(1), 141–188.
Walton, C. M., et al. (2015). Implementing the freight transportation data architecture: Data element dictionary, National Academies Press, Washington, DC.
Webster, J. J., and Kit, C. (1992). “Tokenization as the initial phase in nlp.” Proc., 14th Conf. on Computational Linguistics-Volume 4, Association for Computational Linguistics, Stroudsburg, PA, 1106–1110.
Wetherill, M., Rezgui, Y., Lima, C., and Zarli, A. (2003). “Knowledge management for the construction industry: The e-cognos project.” J. Inf. Technol. Constr., 7(12), 183–196.
Yalcinkaya, M., and Singh, V. (2015). “Patterns and trends in building information modeling (bim) research: A latent semantic analysis.” Autom. Constr., 59, 68–80.
Yarowsky, D. (1995). “Unsupervised word sense disambiguation rivaling supervised methods.” Proc., 33rd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, 189–196.
Zhang, J., and El-Gohary, N. (2016). “Extending building information models semiautomatically using semantic natural language processing techniques.” J. Comput. Civil Eng., C4016004.
Zhang, Z., Iria, J., Brewster, C., and Ciravegna, F. (2008). “A comparative evaluation of term recognition algorithms.” Proc., 6th Int. Conf. on Language Recourses and Evaluation, European Language Resources Association, Paris.
Zhao, H., and Kit, C. (2011). “Integrating unsupervised and supervised word segmentation: The role of goodness measures.” Inf. Sci., 181(1), 163–183.
Information & Authors
Information
Published In
Copyright
©2017 American Society of Civil Engineers.
History
Received: May 22, 2016
Accepted: Apr 7, 2017
Published online: Jul 29, 2017
Published in print: Nov 1, 2017
Discussion open until: Dec 29, 2017
Authors
Metrics & Citations
Metrics
Citations
Download citation
If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.